You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by kanzhang <gi...@git.apache.org> on 2014/04/19 00:28:30 UTC

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

GitHub user kanzhang opened a pull request:

    https://github.com/apache/spark/pull/448

    [SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations...

    ... that do not change schema

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kanzhang/spark SPARK-1460

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/448.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #448
    
----
commit 79aeef800d5c1c12acc02ecf5822d9814e792599
Author: Kan Zhang <kz...@apache.org>
Date:   2014-04-18T22:23:26Z

    [SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations that do not change schema

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42264403
  
    I don't think it's very obvious :)
    
    My guess would be that this.type is included in a scala specific signature somewhere, and the scala compiler adds casts after all the calls when the method is called on a refined type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42249441
  
    Yeah I agree they look the same.  I guess this.type translates to the the current class and any other specialization for child classes is done by scala magic.
    
    Other than the minor comments about removing unneeded code and avoiding too much scala in the java api I think this is about ready to merge.
    
    Thanks again for investigating this in such detail!  I've personally been bumping into this issue quite a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42253939
  
     Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-41230336
  
    Hey, just pushed an update on Scala and Java API. Wanted to get some feedback before I move on to Python. Pls pay attention to signatures of filter, intersection and subtract on the Java API. Thx.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42360847
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14731/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-40964336
  
    Oh, I see. Thx.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r11816864
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
    @@ -313,4 +314,46 @@ class SchemaRDD(
           }
         }
       }
    +
    +  /**
    +   * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
    +   * RDD functions that do not change schema.
    +   *
    +   * @param childRDD RDD derived from this one and has same schema
    +   *
    +   * @group schema
    +   */
    +  private def applySchema(childRDD: RDD[Row]): SchemaRDD =
    +    new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
    +
    +  // =======================================================================
    +  // Base RDD functions that do NOT change schema
    +  // =======================================================================
    +
    +  override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
    +    applySchema(super.coalesce(numPartitions, shuffle))
    +
    +  override def distinct(numPartitions: Int): SchemaRDD =
    +    applySchema(super.distinct(numPartitions))
    +
    +  override def filter(f: Row => Boolean): SchemaRDD =
    +    applySchema(super.filter(f))
    +
    +  override def intersection(other: RDD[Row]): SchemaRDD =
    +    applySchema(super.intersection(other))
    +
    +  override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
    +    applySchema(super.intersection(other, partitioner))
    +
    +  override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
    +    applySchema(super.intersection(other, numPartitions))
    +
    +  override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
    --- End diff --
    
    Isn't the base impl lazy also (till compute() is called)? It's kind of hard to tell the difference for users. What's your thinking in keeping the base implementation around (not overriding it)? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-40850347
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12347808
  
    --- Diff: graphx/src/main/scala/org/apache/spark/graphx/EdgeRDD.scala ---
    @@ -51,18 +51,12 @@ class EdgeRDD[@specialized ED: ClassTag](
     
       override def collect(): Array[Edge[ED]] = this.map(_.copy()).collect()
     
    -  override def persist(newLevel: StorageLevel): EdgeRDD[ED] = {
    +  override def persist(newLevel: StorageLevel): this.type = {
         partitionsRDD.persist(newLevel)
         this
       }
     
    -  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
    -  override def persist(): EdgeRDD[ED] = persist(StorageLevel.MEMORY_ONLY)
    -
    -  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
    -  override def cache(): EdgeRDD[ED] = persist()
    --- End diff --
    
    Got it, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-40916074
  
    @marmbrus you are right, I can't override randomSplit() due to invariance of Array.
    
    How about cache(), persist(), unpersist()?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12351929
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
        * it is computed. This can only be used to assign a new storage level if the RDD does not
        * have a storage level set yet..
        */
    -  def persist(newLevel: StorageLevel): RDD[T] = {
    +  def persist(newLevel: StorageLevel): this.type = {
         // TODO: Handle changes of StorageLevel
    --- End diff --
    
    @mridulm agree with Patrick, you have to return ```this``` for ```this.type``` return type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42360844
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12349982
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
        * it is computed. This can only be used to assign a new storage level if the RDD does not
        * have a storage level set yet..
        */
    -  def persist(newLevel: StorageLevel): RDD[T] = {
    +  def persist(newLevel: StorageLevel): this.type = {
         // TODO: Handle changes of StorageLevel
    --- End diff --
    
    @mridulm if you look at this patch, it explicitly overrides those for `SchemaRDD`. You can't use `this.type` there because the return type is actually a new RDD class (`FilteredRDD` and so on).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42377607
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42350425
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42385119
  
    @pwendell the build didn't seem to start?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12347791
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
        * it is computed. This can only be used to assign a new storage level if the RDD does not
        * have a storage level set yet..
        */
    -  def persist(newLevel: StorageLevel): RDD[T] = {
    +  def persist(newLevel: StorageLevel): this.type = {
         // TODO: Handle changes of StorageLevel
    --- End diff --
    
    Neat, thanks !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42381666
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-41965083
  
     Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42366216
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42253947
  
    Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42382227
  
    Jenkins, retest this please.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-41966458
  
    @marmbrus @mateiz Here's an update on using this.type as return type. See following shell output. The result type for setName() method has changed from org.apache.spark.rdd.RDD[Record] to rdd.type. Similar things happened for SchemaRDD and JavaSchemaRDD (not a subclass). However, object types for rdd, srdd, and jsrdd still print the same. Are these changes of concern? The benefit is we don't have to reimplement those methods in subclasses. 
    
    Also, any tip on quickly checking Scaladoc and Javadoc impact? Thx.
    
    {code}
    scala> rdd.setName("RDD")
    res0: rdd.type = RDD ParallelCollectionRDD[0] at parallelize at <console>:16
    
    scala> rdd
    res1: org.apache.spark.rdd.RDD[Record] = RDD ParallelCollectionRDD[0] at parallelize at <console>:16
    
    scala> srdd.setName("SCHEMA RDD")
    res2: srdd.type = 
    SCHEMA RDD SchemaRDD[2] at RDD at SchemaRDD.scala:98
    == Query Plan ==
    ExistingRdd [key#0,value#1], MappedRDD[1] at map at basicOperators.scala:147
    
    scala> srdd
    res3: org.apache.spark.sql.SchemaRDD = 
    SCHEMA RDD SchemaRDD[2] at RDD at SchemaRDD.scala:98
    == Query Plan ==
    ExistingRdd [key#0,value#1], MappedRDD[1] at map at basicOperators.scala:147
    
    scala> jsrdd.setName("JAVA SCHEMA RDD")
    res4: jsrdd.type = 
    JAVA SCHEMA RDD SchemaRDD[3] at RDD at SchemaRDD.scala:98
    == Query Plan ==
    ExistingRdd [key#0,value#1], MappedRDD[1] at map at basicOperators.scala:147
    
    scala> jsrdd
    res5: org.apache.spark.sql.api.java.JavaSchemaRDD = 
    JAVA SCHEMA RDD SchemaRDD[3] at RDD at SchemaRDD.scala:98
    == Query Plan ==
    ExistingRdd [key#0,value#1], MappedRDD[1] at map at basicOperators.scala:147
    {code}


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-40880763
  
    I agree with leaving union out and adding repartition, coalesce and the other version of distinct. Also these should definitely be added to Java and Python too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42397473
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-41971131
  
    Build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r11816456
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
    @@ -313,4 +314,46 @@ class SchemaRDD(
           }
         }
       }
    +
    +  /**
    +   * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
    +   * RDD functions that do not change schema.
    +   *
    +   * @param childRDD RDD derived from this one and has same schema
    +   *
    +   * @group schema
    +   */
    +  private def applySchema(childRDD: RDD[Row]): SchemaRDD =
    +    new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
    +
    +  // =======================================================================
    +  // Base RDD functions that do NOT change schema
    +  // =======================================================================
    +
    +  override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
    +    applySchema(super.coalesce(numPartitions, shuffle))
    +
    +  override def distinct(numPartitions: Int): SchemaRDD =
    +    applySchema(super.distinct(numPartitions))
    +
    +  override def filter(f: Row => Boolean): SchemaRDD =
    +    applySchema(super.filter(f))
    +
    +  override def intersection(other: RDD[Row]): SchemaRDD =
    +    applySchema(super.intersection(other))
    +
    +  override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
    +    applySchema(super.intersection(other, partitioner))
    +
    +  override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
    +    applySchema(super.intersection(other, numPartitions))
    +
    +  override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
    +    applySchema(super.sample(withReplacement, fraction, seed))
    +
    +  override def subtract(other: RDD[Row], p: Partitioner): SchemaRDD =
    +    applySchema(super.subtract(other, p))
    +
    +  override def union(other: RDD[Row]): SchemaRDD =
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-40909814
  
    Thanks for your suggestions. I'll update.
    
    Btw, I don't see PythonSchemaRDD in the code base yet, can I leave out Python for now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42397475
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14762/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42248502
  
    @marmbrus my untrained eyes couldn't spot any difference (you may want to verify it yourself just to make sure :). For example,
    
    RDD.unpersist() with RDD[T] return type:
    ```
      public org.apache.spark.rdd.RDD<T> unpersist(boolean);
        flags: ACC_PUBLIC
        Code:
          stack=4, locals=2, args_size=2
             0: aload_0       
             1: new           #208                // class org/apache/spark/rdd/RDD$$anonfun$unpersist$1
             4: dup           
             5: aload_0       
             6: invokespecial #209                // Method org/apache/spark/rdd/RDD$$anonfun$unpersist$1."<init>":(Lorg/apache/spark/rdd/RDD;)V
             9: invokevirtual #211                // Method logInfo:(Lscala/Function0;)V
            12: aload_0       
            13: invokespecial #108                // Method sc:()Lorg/apache/spark/SparkContext;
            16: aload_0       
            17: invokevirtual #213                // Method id:()I
            20: iload_1       
            21: invokevirtual #217                // Method org/apache/spark/SparkContext.unpersistRDD:(IZ)V
            24: aload_0       
            25: getstatic     #157                // Field org/apache/spark/storage/StorageLevel$.MODULE$:Lorg/apache/spark/storage/StorageLevel$;
            28: invokevirtual #160                // Method org/apache/spark/storage/StorageLevel$.NONE:()Lorg/apache/spark/storage/StorageLevel;
            31: invokespecial #186                // Method storageLevel_$eq:(Lorg/apache/spark/storage/StorageLevel;)V
            34: aload_0       
            35: areturn       
          LocalVariableTable:
            Start  Length  Slot  Name   Signature
                   0      36     0  this   Lorg/apache/spark/rdd/RDD;
                   0      36     1 blocking   Z
          LineNumberTable:
            line 167: 0
            line 168: 12
            line 169: 24
            line 170: 34
        Signature: #1819                        // (Z)Lorg/apache/spark/rdd/RDD<TT;>;
    
      public boolean unpersist$default$1();
        flags: ACC_PUBLIC
        Code:
          stack=1, locals=1, args_size=1
             0: iconst_1      
             1: ireturn       
          LocalVariableTable:
            Start  Length  Slot  Name   Signature
                   0       2     0  this   Lorg/apache/spark/rdd/RDD;
          LineNumberTable:
            line 166: 0
    ```
    
    RDD.unpersist() with this.type return type:
    ```
      public org.apache.spark.rdd.RDD<T> unpersist(boolean);
        flags: ACC_PUBLIC
        Code:
          stack=4, locals=2, args_size=2
             0: aload_0       
             1: new           #208                // class org/apache/spark/rdd/RDD$$anonfun$unpersist$1
             4: dup           
             5: aload_0       
             6: invokespecial #209                // Method org/apache/spark/rdd/RDD$$anonfun$unpersist$1."<init>":(Lorg/apache/spark/rdd/RDD;)V
             9: invokevirtual #211                // Method logInfo:(Lscala/Function0;)V
            12: aload_0       
            13: invokespecial #108                // Method sc:()Lorg/apache/spark/SparkContext;
            16: aload_0       
            17: invokevirtual #213                // Method id:()I
            20: iload_1       
            21: invokevirtual #217                // Method org/apache/spark/SparkContext.unpersistRDD:(IZ)V
            24: aload_0       
            25: getstatic     #157                // Field org/apache/spark/storage/StorageLevel$.MODULE$:Lorg/apache/spark/storage/StorageLevel$;
            28: invokevirtual #160                // Method org/apache/spark/storage/StorageLevel$.NONE:()Lorg/apache/spark/storage/StorageLevel;
            31: invokespecial #186                // Method storageLevel_$eq:(Lorg/apache/spark/storage/StorageLevel;)V
            34: aload_0       
            35: areturn       
          LocalVariableTable:
            Start  Length  Slot  Name   Signature
                   0      36     0  this   Lorg/apache/spark/rdd/RDD;
                   0      36     1 blocking   Z
          LineNumberTable:
            line 172: 0
            line 173: 12
            line 174: 24
            line 175: 34
        Signature: #1705                        // (Z)Lorg/apache/spark/rdd/RDD<TT;>;
    
      public boolean unpersist$default$1();
        flags: ACC_PUBLIC
        Code:
          stack=1, locals=1, args_size=1
             0: iconst_1      
             1: ireturn       
          LocalVariableTable:
            Start  Length  Slot  Name   Signature
                   0       2     0  this   Lorg/apache/spark/rdd/RDD;
          LineNumberTable:
            line 171: 0
    ```
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12303586
  
    --- Diff: graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala ---
    @@ -71,18 +71,18 @@ class VertexRDD[@specialized VD: ClassTag](
       override protected def getPreferredLocations(s: Partition): Seq[String] =
         partitionsRDD.preferredLocations(s)
     
    -  override def persist(newLevel: StorageLevel): VertexRDD[VD] = {
    +  override def persist(newLevel: StorageLevel): this.type = {
    --- End diff --
    
    @rxin can we just remove these functions if RDD returns this.type instead?  It looks like you were only overloading them for the covariant return type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42115311
  
    Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42350963
  
    @pwendell did a rebase just now and pushed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r11800221
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
    @@ -313,4 +314,46 @@ class SchemaRDD(
           }
         }
       }
    +
    +  /**
    +   * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
    +   * RDD functions that do not change schema.
    +   *
    +   * @param childRDD RDD derived from this one and has same schema
    +   *
    +   * @group schema
    +   */
    +  private def applySchema(childRDD: RDD[Row]): SchemaRDD =
    +    new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
    +
    +  // =======================================================================
    +  // Base RDD functions that do NOT change schema
    +  // =======================================================================
    +
    +  override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
    +    applySchema(super.coalesce(numPartitions, shuffle))
    +
    +  override def distinct(numPartitions: Int): SchemaRDD =
    +    applySchema(super.distinct(numPartitions))
    +
    +  override def filter(f: Row => Boolean): SchemaRDD =
    +    applySchema(super.filter(f))
    +
    +  override def intersection(other: RDD[Row]): SchemaRDD =
    +    applySchema(super.intersection(other))
    +
    +  override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
    +    applySchema(super.intersection(other, partitioner))
    +
    +  override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
    +    applySchema(super.intersection(other, numPartitions))
    +
    +  override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
    --- End diff --
    
    Ah, I see the execution eventually calls the base method, so it is equivalent?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r11800178
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
    @@ -313,4 +314,46 @@ class SchemaRDD(
           }
         }
       }
    +
    +  /**
    +   * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
    +   * RDD functions that do not change schema.
    +   *
    +   * @param childRDD RDD derived from this one and has same schema
    +   *
    +   * @group schema
    +   */
    +  private def applySchema(childRDD: RDD[Row]): SchemaRDD =
    +    new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
    +
    +  // =======================================================================
    +  // Base RDD functions that do NOT change schema
    +  // =======================================================================
    +
    +  override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
    +    applySchema(super.coalesce(numPartitions, shuffle))
    +
    +  override def distinct(numPartitions: Int): SchemaRDD =
    +    applySchema(super.distinct(numPartitions))
    +
    +  override def filter(f: Row => Boolean): SchemaRDD =
    +    applySchema(super.filter(f))
    +
    +  override def intersection(other: RDD[Row]): SchemaRDD =
    +    applySchema(super.intersection(other))
    +
    +  override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
    +    applySchema(super.intersection(other, partitioner))
    +
    +  override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
    +    applySchema(super.intersection(other, numPartitions))
    +
    +  override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
    --- End diff --
    
    Yes, I noticed it. It has a different signature and doesn't override the base sample() method. I thought this was done on purpose - you wanted to keep both implementations, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42393027
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12023710
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/java/JavaSchemaRDD.scala ---
    @@ -45,4 +50,146 @@ class JavaSchemaRDD(
       override def wrapRDD(rdd: RDD[Row]): JavaRDD[Row] = JavaRDD.fromRDD(rdd)
     
       val rdd = baseSchemaRDD.map(new Row(_))
    +
    +  // =======================================================================
    +  // Base RDD functions that do NOT change schema
    +  // =======================================================================
    +
    +  // Common RDD functions
    +
    +  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
    +  def cache(): JavaSchemaRDD = {
    +    baseSchemaRDD.cache()
    +    this
    +  }
    +
    +  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
    +  def persist(): JavaSchemaRDD = {
    +    baseSchemaRDD.persist()
    +    this
    +  }
    +
    +  /**
    +   * Set this RDD's storage level to persist its values across operations after the first time
    +   * it is computed. This can only be used to assign a new storage level if the RDD does not
    +   * have a storage level set yet..
    +   */
    +  def persist(newLevel: StorageLevel): JavaSchemaRDD = {
    +    baseSchemaRDD.persist(newLevel)
    +    this
    +  }
    +
    +  /**
    +   * Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
    +   *
    +   * @param blocking Whether to block until all blocks are deleted.
    +   * @return This RDD.
    +   */
    +  def unpersist(blocking: Boolean = true): JavaSchemaRDD = {
    +    baseSchemaRDD.unpersist(blocking)
    +    this
    +  }
    +
    +  /** Assign a name to this RDD */
    +  def setName(name: String): JavaSchemaRDD = {
    +    baseSchemaRDD.setName(name)
    +    this
    +  }
    +
    +  // Transformations (return a new RDD)
    +
    +  /**
    +   * Return a new RDD that is reduced into `numPartitions` partitions.
    +   */
    +  def coalesce(numPartitions: Int, shuffle: Boolean = false): JavaSchemaRDD =
    +    baseSchemaRDD.coalesce(numPartitions, shuffle)
    +
    +  /**
    +   * Return a new RDD containing the distinct elements in this RDD.
    +   */
    +  def distinct(): JavaSchemaRDD = baseSchemaRDD.distinct()
    +
    +  /**
    +   * Return a new RDD containing the distinct elements in this RDD.
    +   */
    +  def distinct(numPartitions: Int): JavaSchemaRDD =
    +    baseSchemaRDD.distinct(numPartitions)
    +
    +  /**
    +   * Return a new RDD containing only the elements that satisfy a predicate.
    +   */
    +  def filter(f: JFunction[Row, java.lang.Boolean]): JavaSchemaRDD =
    +    baseSchemaRDD.filter(x => f.call(new Row(x)).booleanValue())
    +
    +  /**
    +   * Return the intersection of this RDD and another one. The output will not contain any
    +   * duplicate elements, even if the input RDDs did.
    +   *
    +   * Note that this method performs a shuffle internally.
    +   */
    +  def intersection(other: JavaSchemaRDD): JavaSchemaRDD =
    +    baseSchemaRDD.intersection(other)
    +
    +  /**
    +   * Return the intersection of this RDD and another one. The output will not contain any
    +   * duplicate elements, even if the input RDDs did.
    +   *
    +   * Note that this method performs a shuffle internally.
    +   *
    +   * @param partitioner Partitioner to use for the resulting RDD
    +   */
    +  def intersection(other: JavaSchemaRDD, partitioner: Partitioner): JavaSchemaRDD =
    +    baseSchemaRDD.intersection(other, partitioner)
    +
    +  /**
    +   * Return the intersection of this RDD and another one. The output will not contain any
    +   * duplicate elements, even if the input RDDs did.  Performs a hash partition across the cluster
    +   *
    +   * Note that this method performs a shuffle internally.
    +   *
    +   * @param numPartitions How many partitions to use in the resulting RDD
    +   */
    +  def intersection(other: JavaSchemaRDD, numPartitions: Int): JavaSchemaRDD =
    +    baseSchemaRDD.intersection(other, numPartitions)
    +
    +  /**
    +   * Return a new RDD that has exactly `numPartitions` partitions.
    +   *
    +   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
    +   * a shuffle to redistribute data.
    +   *
    +   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
    +   * which can avoid performing a shuffle.
    +   */
    +  def repartition(numPartitions: Int): JavaSchemaRDD =
    +    baseSchemaRDD.repartition(numPartitions)
    +
    +  /**
    +   * Return an RDD with the elements from `this` that are not in `other`.
    +   *
    +   * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
    +   * RDD will be <= us.
    +   */
    +  def subtract(other: JavaSchemaRDD): JavaSchemaRDD =
    +    baseSchemaRDD.subtract(other)
    +
    +  /**
    +   * Return an RDD with the elements from `this` that are not in `other`.
    +   */
    +  def subtract(other: JavaSchemaRDD, numPartitions: Int): JavaSchemaRDD =
    +    baseSchemaRDD.subtract(other, numPartitions)
    +
    +  /**
    +   * Return an RDD with the elements from `this` that are not in `other`.
    +   */
    +  def subtract(other: JavaSchemaRDD, p: Partitioner): JavaSchemaRDD =
    +    baseSchemaRDD.subtract(other, p)
    +}
    +
    +object JavaSchemaRDD {
    +
    +  implicit def fromSchemaRDD(rdd: SchemaRDD): JavaSchemaRDD =
    --- End diff --
    
    Point taken, will update. Thx


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-41230295
  
    Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12346651
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
        * it is computed. This can only be used to assign a new storage level if the RDD does not
        * have a storage level set yet..
        */
    -  def persist(newLevel: StorageLevel): RDD[T] = {
    +  def persist(newLevel: StorageLevel): this.type = {
         // TODO: Handle changes of StorageLevel
    --- End diff --
    
    It's to allow child classes to not have to override functions like `persist` and `cache` that are used for chaining:
    
    http://scalada.blogspot.com/2008/02/thistype-for-chaining-method-calls.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r11813669
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
    @@ -313,4 +314,46 @@ class SchemaRDD(
           }
         }
       }
    +
    +  /**
    +   * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
    +   * RDD functions that do not change schema.
    +   *
    +   * @param childRDD RDD derived from this one and has same schema
    +   *
    +   * @group schema
    +   */
    +  private def applySchema(childRDD: RDD[Row]): SchemaRDD =
    +    new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
    +
    +  // =======================================================================
    +  // Base RDD functions that do NOT change schema
    +  // =======================================================================
    +
    +  override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
    +    applySchema(super.coalesce(numPartitions, shuffle))
    +
    +  override def distinct(numPartitions: Int): SchemaRDD =
    +    applySchema(super.distinct(numPartitions))
    +
    +  override def filter(f: Row => Boolean): SchemaRDD =
    +    applySchema(super.filter(f))
    +
    +  override def intersection(other: RDD[Row]): SchemaRDD =
    +    applySchema(super.intersection(other))
    +
    +  override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
    +    applySchema(super.intersection(other, partitioner))
    +
    +  override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
    +    applySchema(super.intersection(other, numPartitions))
    +
    +  override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
    +    applySchema(super.sample(withReplacement, fraction, seed))
    +
    +  override def subtract(other: RDD[Row], p: Partitioner): SchemaRDD =
    +    applySchema(super.subtract(other, p))
    +
    +  override def union(other: RDD[Row]): SchemaRDD =
    --- End diff --
    
    don't forget to remove this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12346057
  
    --- Diff: graphx/src/main/scala/org/apache/spark/graphx/EdgeRDD.scala ---
    @@ -51,18 +51,12 @@ class EdgeRDD[@specialized ED: ClassTag](
     
       override def collect(): Array[Edge[ED]] = this.map(_.copy()).collect()
     
    -  override def persist(newLevel: StorageLevel): EdgeRDD[ED] = {
    +  override def persist(newLevel: StorageLevel): this.type = {
         partitionsRDD.persist(newLevel)
         this
       }
     
    -  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
    -  override def persist(): EdgeRDD[ED] = persist(StorageLevel.MEMORY_ONLY)
    -
    -  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
    -  override def cache(): EdgeRDD[ED] = persist()
    --- End diff --
    
    Why were these removed ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12304083
  
    --- Diff: graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala ---
    @@ -71,18 +71,18 @@ class VertexRDD[@specialized VD: ClassTag](
       override protected def getPreferredLocations(s: Partition): Seq[String] =
         partitionsRDD.preferredLocations(s)
     
    -  override def persist(newLevel: StorageLevel): VertexRDD[VD] = {
    +  override def persist(newLevel: StorageLevel): this.type = {
    --- End diff --
    
    Oh right, I guess I was actually looking at the functions `cache` and `persist` below.  Those could be dropped though?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12303651
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/java/JavaSchemaRDD.scala ---
    @@ -45,4 +48,141 @@ class JavaSchemaRDD(
       override def wrapRDD(rdd: RDD[Row]): JavaRDD[Row] = JavaRDD.fromRDD(rdd)
     
       val rdd = baseSchemaRDD.map(new Row(_))
    +
    +  override def toString: String = baseSchemaRDD.toString
    +
    +  // =======================================================================
    +  // Base RDD functions that do NOT change schema
    +  // =======================================================================
    +
    +  // Common RDD functions
    +
    +  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
    +  def cache(): this.type = {
    --- End diff --
    
    It might be safer, clearer to use the explicit type here.  Only for the Java stuff though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42257630
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14687/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-41971132
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14611/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-41234925
  
    Build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12021763
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
    @@ -313,4 +315,82 @@ class SchemaRDD(
           }
         }
       }
    +
    +  /**
    +   * Creates SchemaRDD by applying own schema to derived RDD. Typically used to wrap return value
    +   * of base RDD functions that do not change schema.
    +   *
    +   * @param rdd RDD derived from this one and has same schema
    +   *
    +   * @group schema
    +   */
    +  private def applySchema(rdd: RDD[Row]): SchemaRDD = {
    +    new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, rdd)))
    +  }
    +
    +  // =======================================================================
    +  // Base RDD functions that do NOT change schema
    +  // =======================================================================
    +
    +  // Common RDD functions
    +
    +  override def cache(): SchemaRDD = {
    --- End diff --
    
    I should have mentioned this before, but we could consider using `this.type` instead in the Base RDD class for these methods.  I'm not sure if that is breaking API or too much scala magic though. @mateiz ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-40854403
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42393186
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-40854278
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42365832
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-40856764
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14248/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42272515
  
    @kanzhang this is out of date with master - mind merging it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r11813558
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
    @@ -313,4 +314,46 @@ class SchemaRDD(
           }
         }
       }
    +
    +  /**
    +   * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
    +   * RDD functions that do not change schema.
    +   *
    +   * @param childRDD RDD derived from this one and has same schema
    +   *
    +   * @group schema
    +   */
    +  private def applySchema(childRDD: RDD[Row]): SchemaRDD =
    +    new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
    +
    +  // =======================================================================
    +  // Base RDD functions that do NOT change schema
    +  // =======================================================================
    +
    +  override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
    +    applySchema(super.coalesce(numPartitions, shuffle))
    +
    +  override def distinct(numPartitions: Int): SchemaRDD =
    +    applySchema(super.distinct(numPartitions))
    +
    +  override def filter(f: Row => Boolean): SchemaRDD =
    +    applySchema(super.filter(f))
    +
    +  override def intersection(other: RDD[Row]): SchemaRDD =
    +    applySchema(super.intersection(other))
    +
    +  override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
    +    applySchema(super.intersection(other, partitioner))
    +
    +  override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
    +    applySchema(super.intersection(other, numPartitions))
    +
    +  override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
    --- End diff --
    
    Yeah, you are going to end up getting the same thing.  I'd say we drop this one and leave the other.  Right now it probably doesn't matter, but the other one is lazy and gives the optimizer a chance to possibly improve things before actually executing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r11818451
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
    @@ -313,4 +314,46 @@ class SchemaRDD(
           }
         }
       }
    +
    +  /**
    +   * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
    +   * RDD functions that do not change schema.
    +   *
    +   * @param childRDD RDD derived from this one and has same schema
    +   *
    +   * @group schema
    +   */
    +  private def applySchema(childRDD: RDD[Row]): SchemaRDD =
    +    new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
    +
    +  // =======================================================================
    +  // Base RDD functions that do NOT change schema
    +  // =======================================================================
    +
    +  override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
    +    applySchema(super.coalesce(numPartitions, shuffle))
    +
    +  override def distinct(numPartitions: Int): SchemaRDD =
    +    applySchema(super.distinct(numPartitions))
    +
    +  override def filter(f: Row => Boolean): SchemaRDD =
    +    applySchema(super.filter(f))
    +
    +  override def intersection(other: RDD[Row]): SchemaRDD =
    +    applySchema(super.intersection(other))
    +
    +  override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
    +    applySchema(super.intersection(other, partitioner))
    +
    +  override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
    +    applySchema(super.intersection(other, numPartitions))
    +
    +  override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
    --- End diff --
    
    Oh, you are right.  The base impl is probably lazy too.  The distinction I was trying to make is that while normal RDD operations are lazy, they are not holistically optimized before execution.  Where as if we create a logical operator and defer the creation of RDDs, there may be some extra chances for optimization (at some point in the future).  We definitely want to override the base impl, but we don't need to have multiple redundant methods for creating samples.
    
    Also note that you might need to sync with the changes being made in #462 .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-41452343
  
    Looking pretty good.  Thanks again for working on this!
    
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42115307
  
     Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42249374
  
    Actually, let me run with -s option shortly.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12022387
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
    @@ -313,4 +315,82 @@ class SchemaRDD(
           }
         }
       }
    +
    +  /**
    +   * Creates SchemaRDD by applying own schema to derived RDD. Typically used to wrap return value
    +   * of base RDD functions that do not change schema.
    +   *
    +   * @param rdd RDD derived from this one and has same schema
    +   *
    +   * @group schema
    +   */
    +  private def applySchema(rdd: RDD[Row]): SchemaRDD = {
    +    new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, rdd)))
    +  }
    +
    +  // =======================================================================
    +  // Base RDD functions that do NOT change schema
    +  // =======================================================================
    +
    +  // Common RDD functions
    +
    +  override def cache(): SchemaRDD = {
    --- End diff --
    
    I'd be okay trying it, my questions then are what it looks like in Scaladoc and what it looks like in Java. We should also double-check that Scala expects binary compatibility for this kind of return type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42268877
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-41968161
  
    PS. Subclasses that override those methods may have to be updated and recompiled (like what I did in EdgeRDD, VertexRDD). Better ideas?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12305806
  
    --- Diff: graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala ---
    @@ -71,18 +71,18 @@ class VertexRDD[@specialized VD: ClassTag](
       override protected def getPreferredLocations(s: Partition): Seq[String] =
         partitionsRDD.preferredLocations(s)
     
    -  override def persist(newLevel: StorageLevel): VertexRDD[VD] = {
    +  override def persist(newLevel: StorageLevel): this.type = {
    --- End diff --
    
    Ah, right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12346614
  
    --- Diff: graphx/src/main/scala/org/apache/spark/graphx/EdgeRDD.scala ---
    @@ -51,18 +51,12 @@ class EdgeRDD[@specialized ED: ClassTag](
     
       override def collect(): Array[Edge[ED]] = this.map(_.copy()).collect()
     
    -  override def persist(newLevel: StorageLevel): EdgeRDD[ED] = {
    +  override def persist(newLevel: StorageLevel): this.type = {
         partitionsRDD.persist(newLevel)
         this
       }
     
    -  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
    -  override def persist(): EdgeRDD[ED] = persist(StorageLevel.MEMORY_ONLY)
    -
    -  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
    -  override def cache(): EdgeRDD[ED] = persist()
    --- End diff --
    
    @mridulm It's not necessary to overload these anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42224442
  
    @kanzhang, Good point about the public APIs.  I talked with @pwendell and we are a little concerned that this.type will make binary computability hard long term, mostly because we don't understand the bytecode being produced.  Could you possibly run `javap` on some of the RDD classes and let us know what the function signatures look like for the changed methods?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42248461
  
    Yeah, let's look at that. If it's just Object we may be fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42381669
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14746/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42377614
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12304047
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/java/JavaSchemaRDD.scala ---
    @@ -45,4 +48,141 @@ class JavaSchemaRDD(
       override def wrapRDD(rdd: RDD[Row]): JavaRDD[Row] = JavaRDD.fromRDD(rdd)
     
       val rdd = baseSchemaRDD.map(new Row(_))
    +
    +  override def toString: String = baseSchemaRDD.toString
    +
    +  // =======================================================================
    +  // Base RDD functions that do NOT change schema
    +  // =======================================================================
    +
    +  // Common RDD functions
    +
    +  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
    +  def cache(): this.type = {
    --- End diff --
    
    Ah, yes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42257628
  
    Build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42262939
  
    @marmbrus @kanzhang I'm a bit confused how the bytecode could be totally unchanged.
    
    Let's say I create a new class `FooRDD.scala` in a totally different package and I depend on `RDD` from Spark only via binary (e.g. I'm linking against Spark). How does the compiler know whether `new FooRDD().cache()` should return `FooRDD` or `RDD` if it's not included in the byte-code somewhere?
    
    I think I may be missing something obvious ;)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r11819317
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
    @@ -313,4 +314,46 @@ class SchemaRDD(
           }
         }
       }
    +
    +  /**
    +   * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
    +   * RDD functions that do not change schema.
    +   *
    +   * @param childRDD RDD derived from this one and has same schema
    +   *
    +   * @group schema
    +   */
    +  private def applySchema(childRDD: RDD[Row]): SchemaRDD =
    +    new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
    +
    +  // =======================================================================
    +  // Base RDD functions that do NOT change schema
    +  // =======================================================================
    +
    +  override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
    +    applySchema(super.coalesce(numPartitions, shuffle))
    +
    +  override def distinct(numPartitions: Int): SchemaRDD =
    +    applySchema(super.distinct(numPartitions))
    +
    +  override def filter(f: Row => Boolean): SchemaRDD =
    +    applySchema(super.filter(f))
    +
    +  override def intersection(other: RDD[Row]): SchemaRDD =
    +    applySchema(super.intersection(other))
    +
    +  override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
    +    applySchema(super.intersection(other, partitioner))
    +
    +  override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
    +    applySchema(super.intersection(other, numPartitions))
    +
    +  override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
    --- End diff --
    
    Thanks for the heads-up. So #462 is already doing it, I'll skip it (I meant overriding it with the query one too).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42115441
  
    Added Python API changes. Skipped filter() as I don't see a way to translate Python functions into Java land. Py4j supports implementing Java interfaces in Python callbacks, but that's not applicable here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12345988
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
        * it is computed. This can only be used to assign a new storage level if the RDD does not
        * have a storage level set yet..
        */
    -  def persist(newLevel: StorageLevel): RDD[T] = {
    +  def persist(newLevel: StorageLevel): this.type = {
         // TODO: Handle changes of StorageLevel
    --- End diff --
    
    I am fairly ignorent of scala; I am not sure I follow, where is type coming from ? And what is it exactly ?
    Also , does this change mean it is an incompat interface change ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-40856762
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12023704
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
    @@ -313,4 +315,82 @@ class SchemaRDD(
           }
         }
       }
    +
    +  /**
    +   * Creates SchemaRDD by applying own schema to derived RDD. Typically used to wrap return value
    +   * of base RDD functions that do not change schema.
    +   *
    +   * @param rdd RDD derived from this one and has same schema
    +   *
    +   * @group schema
    +   */
    +  private def applySchema(rdd: RDD[Row]): SchemaRDD = {
    +    new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, rdd)))
    +  }
    +
    +  // =======================================================================
    +  // Base RDD functions that do NOT change schema
    +  // =======================================================================
    +
    +  // Common RDD functions
    +
    +  override def cache(): SchemaRDD = {
    --- End diff --
    
    Thanks for pointing it out, will verify this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42377588
  
    @pwendell thanks for the heads-up. made those changes, let's see how it goes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42264759
  
    @marmbrus I did some local byte code inspection and I think it's injecting this information into the scalasig of the class (in our case, the RDD class). I made a simple class and ran scalap and javap. To java it looks the same but scalap can figure it out. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42371682
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14738/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r11791420
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
    @@ -313,4 +314,46 @@ class SchemaRDD(
           }
         }
       }
    +
    +  /**
    +   * Creates SchemaRDD by applying own schema to child RDD. Typically used to wrap results of base
    +   * RDD functions that do not change schema.
    +   *
    +   * @param childRDD RDD derived from this one and has same schema
    +   *
    +   * @group schema
    +   */
    +  private def applySchema(childRDD: RDD[Row]): SchemaRDD =
    +    new SchemaRDD(sqlContext, SparkLogicalPlan(ExistingRdd(logicalPlan.output, childRDD)))
    +
    +  // =======================================================================
    +  // Base RDD functions that do NOT change schema
    +  // =======================================================================
    +
    +  override def coalesce(numPartitions: Int, shuffle: Boolean = false): SchemaRDD =
    +    applySchema(super.coalesce(numPartitions, shuffle))
    +
    +  override def distinct(numPartitions: Int): SchemaRDD =
    +    applySchema(super.distinct(numPartitions))
    +
    +  override def filter(f: Row => Boolean): SchemaRDD =
    +    applySchema(super.filter(f))
    +
    +  override def intersection(other: RDD[Row]): SchemaRDD =
    +    applySchema(super.intersection(other))
    +
    +  override def intersection(other: RDD[Row], partitioner: Partitioner): SchemaRDD =
    +    applySchema(super.intersection(other, partitioner))
    +
    +  override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
    +    applySchema(super.intersection(other, numPartitions))
    +
    +  override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =
    --- End diff --
    
    There is already a sample method that returns SchemaRDD above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-41230289
  
     Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-40854395
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42264854
  
    @marmbrus anyways I don't see this tying our hands much in the future, so seems like a good idea. It will potentially force people to re-compile coming from pre-1.0 since the scala type signature has changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42373741
  
    @kanzhang hey you'll need to silence some of the binary compatibility checks in project/MimaBuild.scala:
    
    ```
    excludeSparkClass("org.apache.spark.graphx.VertexRDD")
    excludeSparkClass("org.apache.spark.graphx.EdgeRDD")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-41234927
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14414/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12357557
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
        * it is computed. This can only be used to assign a new storage level if the RDD does not
        * have a storage level set yet..
        */
    -  def persist(newLevel: StorageLevel): RDD[T] = {
    +  def persist(newLevel: StorageLevel): this.type = {
         // TODO: Handle changes of StorageLevel
    --- End diff --
    
    Thanks for clarifying, in retrospect that looks obvious !
    On 07-May-2014 2:52 am, "Patrick Wendell" <no...@github.com> wrote:
    
    > In core/src/main/scala/org/apache/spark/rdd/RDD.scala:
    >
    > > @@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
    > >     * it is computed. This can only be used to assign a new storage level if the RDD does not
    > >     * have a storage level set yet..
    > >     */
    > > -  def persist(newLevel: StorageLevel): RDD[T] = {
    > > +  def persist(newLevel: StorageLevel): this.type = {
    > >      // TODO: Handle changes of StorageLevel
    >
    > @mridulm <https://github.com/mridulm> if you look at this patch, it
    > explicitly overrides those for SchemaRDD. You can't use this.type there
    > because the return type is actually a new RDD class (FilteredRDD and so
    > on).
    >
    > —
    > Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/448/files#r12349982>
    > .
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42268837
  
    LGTM then - I looked around a bit and it seems like this use case is exactly what `this.type` is for in Scala.
    
    @marmbrus @mateiz any further feedback?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12021785
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/java/JavaSchemaRDD.scala ---
    @@ -45,4 +50,146 @@ class JavaSchemaRDD(
       override def wrapRDD(rdd: RDD[Row]): JavaRDD[Row] = JavaRDD.fromRDD(rdd)
     
       val rdd = baseSchemaRDD.map(new Row(_))
    +
    +  // =======================================================================
    +  // Base RDD functions that do NOT change schema
    +  // =======================================================================
    +
    +  // Common RDD functions
    +
    +  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
    +  def cache(): JavaSchemaRDD = {
    +    baseSchemaRDD.cache()
    +    this
    +  }
    +
    +  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
    +  def persist(): JavaSchemaRDD = {
    +    baseSchemaRDD.persist()
    +    this
    +  }
    +
    +  /**
    +   * Set this RDD's storage level to persist its values across operations after the first time
    +   * it is computed. This can only be used to assign a new storage level if the RDD does not
    +   * have a storage level set yet..
    +   */
    +  def persist(newLevel: StorageLevel): JavaSchemaRDD = {
    +    baseSchemaRDD.persist(newLevel)
    +    this
    +  }
    +
    +  /**
    +   * Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
    +   *
    +   * @param blocking Whether to block until all blocks are deleted.
    +   * @return This RDD.
    +   */
    +  def unpersist(blocking: Boolean = true): JavaSchemaRDD = {
    +    baseSchemaRDD.unpersist(blocking)
    +    this
    +  }
    +
    +  /** Assign a name to this RDD */
    +  def setName(name: String): JavaSchemaRDD = {
    +    baseSchemaRDD.setName(name)
    +    this
    +  }
    +
    +  // Transformations (return a new RDD)
    +
    +  /**
    +   * Return a new RDD that is reduced into `numPartitions` partitions.
    +   */
    +  def coalesce(numPartitions: Int, shuffle: Boolean = false): JavaSchemaRDD =
    +    baseSchemaRDD.coalesce(numPartitions, shuffle)
    +
    +  /**
    +   * Return a new RDD containing the distinct elements in this RDD.
    +   */
    +  def distinct(): JavaSchemaRDD = baseSchemaRDD.distinct()
    +
    +  /**
    +   * Return a new RDD containing the distinct elements in this RDD.
    +   */
    +  def distinct(numPartitions: Int): JavaSchemaRDD =
    +    baseSchemaRDD.distinct(numPartitions)
    +
    +  /**
    +   * Return a new RDD containing only the elements that satisfy a predicate.
    +   */
    +  def filter(f: JFunction[Row, java.lang.Boolean]): JavaSchemaRDD =
    +    baseSchemaRDD.filter(x => f.call(new Row(x)).booleanValue())
    +
    +  /**
    +   * Return the intersection of this RDD and another one. The output will not contain any
    +   * duplicate elements, even if the input RDDs did.
    +   *
    +   * Note that this method performs a shuffle internally.
    +   */
    +  def intersection(other: JavaSchemaRDD): JavaSchemaRDD =
    +    baseSchemaRDD.intersection(other)
    +
    +  /**
    +   * Return the intersection of this RDD and another one. The output will not contain any
    +   * duplicate elements, even if the input RDDs did.
    +   *
    +   * Note that this method performs a shuffle internally.
    +   *
    +   * @param partitioner Partitioner to use for the resulting RDD
    +   */
    +  def intersection(other: JavaSchemaRDD, partitioner: Partitioner): JavaSchemaRDD =
    +    baseSchemaRDD.intersection(other, partitioner)
    +
    +  /**
    +   * Return the intersection of this RDD and another one. The output will not contain any
    +   * duplicate elements, even if the input RDDs did.  Performs a hash partition across the cluster
    +   *
    +   * Note that this method performs a shuffle internally.
    +   *
    +   * @param numPartitions How many partitions to use in the resulting RDD
    +   */
    +  def intersection(other: JavaSchemaRDD, numPartitions: Int): JavaSchemaRDD =
    +    baseSchemaRDD.intersection(other, numPartitions)
    +
    +  /**
    +   * Return a new RDD that has exactly `numPartitions` partitions.
    +   *
    +   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
    +   * a shuffle to redistribute data.
    +   *
    +   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
    +   * which can avoid performing a shuffle.
    +   */
    +  def repartition(numPartitions: Int): JavaSchemaRDD =
    +    baseSchemaRDD.repartition(numPartitions)
    +
    +  /**
    +   * Return an RDD with the elements from `this` that are not in `other`.
    +   *
    +   * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
    +   * RDD will be <= us.
    +   */
    +  def subtract(other: JavaSchemaRDD): JavaSchemaRDD =
    +    baseSchemaRDD.subtract(other)
    +
    +  /**
    +   * Return an RDD with the elements from `this` that are not in `other`.
    +   */
    +  def subtract(other: JavaSchemaRDD, numPartitions: Int): JavaSchemaRDD =
    +    baseSchemaRDD.subtract(other, numPartitions)
    +
    +  /**
    +   * Return an RDD with the elements from `this` that are not in `other`.
    +   */
    +  def subtract(other: JavaSchemaRDD, p: Partitioner): JavaSchemaRDD =
    +    baseSchemaRDD.subtract(other, p)
    +}
    +
    +object JavaSchemaRDD {
    +
    +  implicit def fromSchemaRDD(rdd: SchemaRDD): JavaSchemaRDD =
    --- End diff --
    
    While I'm generally a fan of scala trickery. I have found massaging type with implicits often comes back to bite you.  You may be safe here, but I'd still prefer we just call a wrap method explicitly. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42350396
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-41965097
  
    Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12348272
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
        * it is computed. This can only be used to assign a new storage level if the RDD does not
        * have a storage level set yet..
        */
    -  def persist(newLevel: StorageLevel): RDD[T] = {
    +  def persist(newLevel: StorageLevel): this.type = {
         // TODO: Handle changes of StorageLevel
    --- End diff --
    
    So I guess this cant be applied to checkpointRDD and randomSplit ?
    What about things like filter, distinct, repartition, sample, filterWith etc ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42254028
  
    Pushed an update based on @marmbrus comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/448#discussion_r12304013
  
    --- Diff: graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala ---
    @@ -71,18 +71,18 @@ class VertexRDD[@specialized VD: ClassTag](
       override protected def getPreferredLocations(s: Partition): Seq[String] =
         partitionsRDD.preferredLocations(s)
     
    -  override def persist(newLevel: StorageLevel): VertexRDD[VD] = {
    +  override def persist(newLevel: StorageLevel): this.type = {
    --- End diff --
    
    @marmbrus it actually has its own logic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/448#issuecomment-40855861

Thanks for doing this!

I think we are actually okay for `intersect` and `subtract` as anything in the result must be a row that was in the original RDD and thus must have a correct schema. If you intersect with a different schema you will get back an empty rdd. If you subtract with an different schema the subtraction will be a no-op and you'll get back the original RDD.

Union is a little more troublesome. We could check the schema and throw an error if they don't match, but that is kinda changing the semantics relative the the standard `union` call on RDD. Also, when we do a SQL union we do type widening, so just calling RDD union and returning a `SchemaRDD` is a little weird.

So, I'd propose we leave union out, as users that want SQL semantics here can already call unionAll. @mateiz might have thoughts here too.

A few other methods we can add that also don't change the schema:
- `distinct()` with no `numPartitions`
- `repartition`
- `setName(...)` ?
- `randomSplit` (not sure if this is okay since `Array` is invariant)

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42371680
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/448


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-40949024
  
    > How about cache(), persist(), unpersist()?
    
    Good catch!
    
    > Btw, I don't see PythonSchemaRDD in the code base yet, can I leave out Python for now?
    
    It is just called `SchemaRDD` and is located in `python/pyspark/sql.py`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-41972942
  
    @marmbrus Instead of current JavaSchemaRDD.fromSchemaRDD, I'm tempted to add a toJavaSchemaRDD method in SchemaRDD to allow better method chaining. Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42116989
  
    Build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42217278
  
    @marmbrus @mateiz Just a heads-up, changing to use this.type as return type may break 1.0 public API if we don't get this change in (see comment above).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42252573
  
    The -s option isn't very interesting. The couple lines added are identical in both cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42266085
  
    @pwendell yes, scalap shows it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42116990
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14634/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42451285
  
    Thanks for updating this. I'm merging it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42393192
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-42366235
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-40856180
  
    Also, we should make the same changes to the Java and Python API if possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1460] Returning SchemaRDD instead of no...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/448#issuecomment-40850230
  
    First try, pls comment. Not very comfortable with methods that take other RDDs, like intersect, subtract and union, since caller has to make sure they of the same schema.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---