You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by gatorsmile <gi...@git.apache.org> on 2015/12/08 03:15:16 UTC

[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/10184

    [SPARK-12188] [SQL] Code refactoring and comment correction in Dataset APIs

    This PR contains the following updates:
    
    - Created a new private variable `boundTEncoder` that can be shared by multiple functions, `RDD`, `select` and `collect`. 
    - Replaced all the `queryExecution.analyzed` by the function call `logicalPlan`
    - A few API comments are using wrong class names (e.g., `DataFrame`) or parameter names (e.g., `n`)
    - A few API descriptions are wrong. (e.g., `mapPartitions`)
    
    @marmbrus @rxin @cloud-fan Could you take a look and check if they are appropriate? Thank you!

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark datasetClean

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10184.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10184
    
----
commit abe55fffe514a3064a5935ccd63b747769fb5f67
Author: gatorsmile <ga...@gmail.com>
Date:   2015-12-07T04:27:11Z

    Merge remote-tracking branch 'upstream/master' into encoder

commit c615d8f46b2528b1be431c8d29f9f227305dbce0
Author: gatorsmile <ga...@gmail.com>
Date:   2015-12-07T22:31:14Z

    comment updates.

commit e97fa7709d1257314000ff01984123ffba1bf75d
Author: gatorsmile <ga...@gmail.com>
Date:   2015-12-07T23:24:14Z

    comment updates and code cleaning

commit d510d6aed40896dea5e0b60ac09f3dab73d3c857
Author: gatorsmile <ga...@gmail.com>
Date:   2015-12-07T23:55:58Z

    comment style fix.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10184#issuecomment-162752001
  
    **[Test build #47298 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47298/consoleFull)** for PR 10184 at commit [`d510d6a`](https://github.com/apache/spark/commit/d510d6aed40896dea5e0b60ac09f3dab73d3c857).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10184#discussion_r47046420
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -429,18 +432,18 @@ class Dataset[T] private[sql](
     
       /**
        * (Java-specific)
    -   * Returns a [[GroupedDataset]] where the data is grouped by the given key function.
    +   * Returns a [[GroupedDataset]] where the data is grouped by the given key `func`.
        * @since 1.6.0
        */
    -  def groupBy[K](f: MapFunction[T, K], encoder: Encoder[K]): GroupedDataset[K, T] =
    -    groupBy(f.call(_))(encoder)
    +  def groupBy[K](func: MapFunction[T, K], encoder: Encoder[K]): GroupedDataset[K, T] =
    --- End diff --
    
    Sure, next time, I will be careful. Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10184#issuecomment-162752661
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47300/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10184


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10184#issuecomment-162739073
  
    **[Test build #47298 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47298/consoleFull)** for PR 10184 at commit [`d510d6a`](https://github.com/apache/spark/commit/d510d6aed40896dea5e0b60ac09f3dab73d3c857).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10184#issuecomment-162752577
  
    **[Test build #47300 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47300/consoleFull)** for PR 10184 at commit [`d510d6a`](https://github.com/apache/spark/commit/d510d6aed40896dea5e0b60ac09f3dab73d3c857).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/10184#issuecomment-162738003
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10184#discussion_r47046675
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -67,15 +67,21 @@ class Dataset[T] private[sql](
         tEncoder: Encoder[T]) extends Queryable with Serializable {
     
       /**
    -   * An unresolved version of the internal encoder for the type of this dataset.  This one is marked
    -   * implicit so that we can use it when constructing new [[Dataset]] objects that have the same
    -   * object type (that will be possibly resolved to a different schema).
    +   * An unresolved version of the internal encoder for the type of this [[Dataset]].  This one is
    +   * marked implicit so that we can use it when constructing new [[Dataset]] objects that have the
    +   * same object type (that will be possibly resolved to a different schema).
        */
       private[sql] implicit val unresolvedTEncoder: ExpressionEncoder[T] = encoderFor(tEncoder)
     
       /** The encoder for this [[Dataset]] that has been resolved to its output schema. */
       private[sql] val resolvedTEncoder: ExpressionEncoder[T] =
    -    unresolvedTEncoder.resolve(queryExecution.analyzed.output, OuterScopes.outerScopes)
    +    unresolvedTEncoder.resolve(logicalPlan.output, OuterScopes.outerScopes)
    +
    +  /**
    +   * The encoder where the expressions used to construct an object from an input row have been
    +   * bound to the ordinals of the given schema.
    --- End diff --
    
    oh, actually i forgot :(


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10184#discussion_r46992054
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -429,18 +432,18 @@ class Dataset[T] private[sql](
     
       /**
        * (Java-specific)
    -   * Returns a [[GroupedDataset]] where the data is grouped by the given key function.
    +   * Returns a [[GroupedDataset]] where the data is grouped by the given key `func`.
        * @since 1.6.0
        */
    -  def groupBy[K](f: MapFunction[T, K], encoder: Encoder[K]): GroupedDataset[K, T] =
    -    groupBy(f.call(_))(encoder)
    +  def groupBy[K](func: MapFunction[T, K], encoder: Encoder[K]): GroupedDataset[K, T] =
    --- End diff --
    
    A note: this is fine since RC1 failed, but we can't make these kinds of changes in the future as they break compatibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/10184#issuecomment-162969383
  
    Thanks, I'm going to merge this to master and 1.6.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10184#issuecomment-162752660
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10184#issuecomment-162752149
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47298/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10184#discussion_r47046739
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -67,15 +67,21 @@ class Dataset[T] private[sql](
         tEncoder: Encoder[T]) extends Queryable with Serializable {
     
       /**
    -   * An unresolved version of the internal encoder for the type of this dataset.  This one is marked
    -   * implicit so that we can use it when constructing new [[Dataset]] objects that have the same
    -   * object type (that will be possibly resolved to a different schema).
    +   * An unresolved version of the internal encoder for the type of this [[Dataset]].  This one is
    +   * marked implicit so that we can use it when constructing new [[Dataset]] objects that have the
    +   * same object type (that will be possibly resolved to a different schema).
        */
       private[sql] implicit val unresolvedTEncoder: ExpressionEncoder[T] = encoderFor(tEncoder)
     
       /** The encoder for this [[Dataset]] that has been resolved to its output schema. */
       private[sql] val resolvedTEncoder: ExpressionEncoder[T] =
    -    unresolvedTEncoder.resolve(queryExecution.analyzed.output, OuterScopes.outerScopes)
    +    unresolvedTEncoder.resolve(logicalPlan.output, OuterScopes.outerScopes)
    +
    +  /**
    +   * The encoder where the expressions used to construct an object from an input row have been
    +   * bound to the ordinals of the given schema.
    --- End diff --
    
    Let me add the change in a follow-up PR. : ) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10184#issuecomment-162739762
  
    **[Test build #47300 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47300/consoleFull)** for PR 10184 at commit [`d510d6a`](https://github.com/apache/spark/commit/d510d6aed40896dea5e0b60ac09f3dab73d3c857).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10184#discussion_r47046431
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -67,15 +67,21 @@ class Dataset[T] private[sql](
         tEncoder: Encoder[T]) extends Queryable with Serializable {
     
       /**
    -   * An unresolved version of the internal encoder for the type of this dataset.  This one is marked
    -   * implicit so that we can use it when constructing new [[Dataset]] objects that have the same
    -   * object type (that will be possibly resolved to a different schema).
    +   * An unresolved version of the internal encoder for the type of this [[Dataset]].  This one is
    +   * marked implicit so that we can use it when constructing new [[Dataset]] objects that have the
    +   * same object type (that will be possibly resolved to a different schema).
        */
       private[sql] implicit val unresolvedTEncoder: ExpressionEncoder[T] = encoderFor(tEncoder)
     
       /** The encoder for this [[Dataset]] that has been resolved to its output schema. */
       private[sql] val resolvedTEncoder: ExpressionEncoder[T] =
    -    unresolvedTEncoder.resolve(queryExecution.analyzed.output, OuterScopes.outerScopes)
    +    unresolvedTEncoder.resolve(logicalPlan.output, OuterScopes.outerScopes)
    +
    +  /**
    +   * The encoder where the expressions used to construct an object from an input row have been
    +   * bound to the ordinals of the given schema.
    --- End diff --
    
    I see. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10184#discussion_r46991908
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -67,15 +67,21 @@ class Dataset[T] private[sql](
         tEncoder: Encoder[T]) extends Queryable with Serializable {
     
       /**
    -   * An unresolved version of the internal encoder for the type of this dataset.  This one is marked
    -   * implicit so that we can use it when constructing new [[Dataset]] objects that have the same
    -   * object type (that will be possibly resolved to a different schema).
    +   * An unresolved version of the internal encoder for the type of this [[Dataset]].  This one is
    +   * marked implicit so that we can use it when constructing new [[Dataset]] objects that have the
    +   * same object type (that will be possibly resolved to a different schema).
        */
       private[sql] implicit val unresolvedTEncoder: ExpressionEncoder[T] = encoderFor(tEncoder)
     
       /** The encoder for this [[Dataset]] that has been resolved to its output schema. */
       private[sql] val resolvedTEncoder: ExpressionEncoder[T] =
    -    unresolvedTEncoder.resolve(queryExecution.analyzed.output, OuterScopes.outerScopes)
    +    unresolvedTEncoder.resolve(logicalPlan.output, OuterScopes.outerScopes)
    +
    +  /**
    +   * The encoder where the expressions used to construct an object from an input row have been
    +   * bound to the ordinals of the given schema.
    --- End diff --
    
    Nit: I'm going to change this to say `this [[Dataset]]'s output schema`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10184#issuecomment-162752145
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org