You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by bomeng <gi...@git.apache.org> on 2016/05/16 20:47:49 UTC

[GitHub] spark pull request: [SPARK-15230] [SQL] distinct() does not handle...

GitHub user bomeng opened a pull request:

    https://github.com/apache/spark/pull/13140

    [SPARK-15230] [SQL] distinct() does not handle column name with dot properly

    ## What changes were proposed in this pull request?
    
    When table is created with column name containing dot, distinct() will fail to run. For example,
    ```scala
    val rowRDD = sparkContext.parallelize(Seq(Row(1), Row(1), Row(2)))
    val schema = StructType(Array(StructField("column.with.dot", IntegerType, nullable = false)))
    val df = spark.createDataFrame(rowRDD, schema)
    ```
    running the following will have no problem: 
    ```scala
    df.select(new Column("`column.with.dot`"))
    ```
    but running the query with additional distinct() will cause exception:
    ```scala
    df.select(new Column("`column.with.dot`")).distinct()
    ```
    
    The issue is that distinct() will try to resolve the column name, but the column name in the schema does not have backtick with it. So the solution is to add the backtick before passing the column name to resolve().
    
    ## How was this patch tested?
    
    Added a new test case.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bomeng/spark SPARK-15230

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13140.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13140
    
----
commit 2f7ffbd58a3437898f32e7603ca6b603f5fd5088
Author: bomeng <bm...@us.ibm.com>
Date:   2016-05-16T20:37:54Z

    fix distinct()

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    **[Test build #61084 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61084/consoleFull)** for PR 13140 at commit [`23a3a50`](https://github.com/apache/spark/commit/23a3a50c79c52bbc0ba440aeade0b3c80b3811b9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60429/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    **[Test build #61062 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61062/consoleFull)** for PR 13140 at commit [`23a3a50`](https://github.com/apache/spark/commit/23a3a50c79c52bbc0ba440aeade0b3c80b3811b9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    **[Test build #61045 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61045/consoleFull)** for PR 13140 at commit [`0e565db`](https://github.com/apache/spark/commit/0e565db5662bc211dffac2881eb3a66ebfaa689c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15230] [SQL] distinct() does not handle...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13140#issuecomment-219564436
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    **[Test build #60429 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60429/consoleFull)** for PR 13140 at commit [`2f7ffbd`](https://github.com/apache/spark/commit/2f7ffbd58a3437898f32e7603ca6b603f5fd5088).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13140: [SPARK-15230] [SQL] distinct() does not handle co...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13140


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by bomeng <gi...@git.apache.org>.

Github user bomeng commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    i do not know what happened to jenkin, looks the failure is irrelevant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61062/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    **[Test build #61062 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61062/consoleFull)** for PR 13140 at commit [`23a3a50`](https://github.com/apache/spark/commit/23a3a50c79c52bbc0ba440aeade0b3c80b3811b9).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ElementwiseProduct @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)`
      * `class Normalizer @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)`
      * `class PolynomialExpansion @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    **[Test build #60959 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60959/consoleFull)** for PR 13140 at commit [`48a6f50`](https://github.com/apache/spark/commit/48a6f50199415ebf436523daaf78bee99b553ea9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by bomeng <gi...@git.apache.org>.

Github user bomeng commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    @cloud-fan thanks for your concise codes!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15230] [SQL] distinct() does not handle...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13140#issuecomment-219564438
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58656/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61084/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60959/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15230] [SQL] distinct() does not handle...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13140#issuecomment-219544289
  
    **[Test build #58656 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58656/consoleFull)** for PR 13140 at commit [`2f7ffbd`](https://github.com/apache/spark/commit/2f7ffbd58a3437898f32e7603ca6b603f5fd5088).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60880/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13140: [SPARK-15230] [SQL] distinct() does not handle co...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13140#discussion_r67605281
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1769,7 +1769,10 @@ class Dataset[T] private[sql](
        * @since 2.0.0
        */
       def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
    -    val groupCols = colNames.map(resolve)
    +    val groupCols = colNames.map {
    +      case column if column.contains(".") && !column.contains("`") => s"`$column`"
    --- End diff --
    
    Actually we do have a lot of places that trying to resolve column name without parsing it, e.g. `withColumn`, `drop`. We can abstract the logic into a method and call it instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    **[Test build #60880 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60880/consoleFull)** for PR 13140 at commit [`0c72506`](https://github.com/apache/spark/commit/0c72506fde74a1b60cb0b0a8f8ff2b8549a98fe8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13140: [SPARK-15230] [SQL] distinct() does not handle co...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13140#discussion_r67791329
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1812,7 +1812,13 @@ class Dataset[T] private[sql](
        * @since 2.0.0
        */
       def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
    -    val groupCols = colNames.map(resolve)
    +    val resolver = sparkSession.sessionState.analyzer.resolver
    +    val groupCols = colNames.map {
    --- End diff --
    
    this can be:
    ```
    val allColumns = queryExecution.analyzed.output
    val groupCols = colNames.map { c =>
      allColumns.find(col => resolver(col.name, c)).getOrElse(throw new AnalysisException(...))
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    LGTM except a minor comment about test, thanks for working on it!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13140: [SPARK-15230] [SQL] distinct() does not handle co...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13140#discussion_r67429987
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1769,7 +1769,10 @@ class Dataset[T] private[sql](
        * @since 2.0.0
        */
       def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
    -    val groupCols = colNames.map(resolve)
    +    val groupCols = colNames.map {
    +      case column if column.contains(".") && !column.contains("`") => s"`$column`"
    --- End diff --
    
    is there another place where we handle the parsing? If so then we're duplicating some code here. @cloud-fan


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    **[Test build #61084 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61084/consoleFull)** for PR 13140 at commit [`23a3a50`](https://github.com/apache/spark/commit/23a3a50c79c52bbc0ba440aeade0b3c80b3811b9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ElementwiseProduct @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)`
      * `class Normalizer @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)`
      * `class PolynomialExpansion @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    **[Test build #60880 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60880/consoleFull)** for PR 13140 at commit [`0c72506`](https://github.com/apache/spark/commit/0c72506fde74a1b60cb0b0a8f8ff2b8549a98fe8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15230] [SQL] distinct() does not handle...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13140#issuecomment-219564225
  
    **[Test build #58656 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58656/consoleFull)** for PR 13140 at commit [`2f7ffbd`](https://github.com/apache/spark/commit/2f7ffbd58a3437898f32e7603ca6b603f5fd5088).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    thanks, merging to master and 2.0!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    **[Test build #61045 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61045/consoleFull)** for PR 13140 at commit [`0e565db`](https://github.com/apache/spark/commit/0e565db5662bc211dffac2881eb3a66ebfaa689c).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13140: [SPARK-15230] [SQL] distinct() does not handle co...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13140#discussion_r67605292
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1769,7 +1769,10 @@ class Dataset[T] private[sql](
        * @since 2.0.0
        */
       def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
    -    val groupCols = colNames.map(resolve)
    +    val groupCols = colNames.map {
    +      case column if column.contains(".") && !column.contains("`") => s"`$column`"
    --- End diff --
    
    BTW, adding backticks and parsing it is not a good way, please at least copy the code from `withColumn` or `drop`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13140: [SPARK-15230] [SQL] distinct() does not handle co...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13140#discussion_r67972467
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -1536,4 +1536,12 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           Utils.deleteRecursively(baseDir)
         }
       }
    +
    +  test("SPARK-15230: distinct() does not handle column name with dot properly") {
    +    val rowRDD = sparkContext.parallelize(Seq(Row(1), Row(1), Row(2)))
    +    val schema = StructType(Array(StructField("column.with.dot", IntegerType, nullable = false)))
    +    val df = spark.createDataFrame(rowRDD, schema)
    +
    +    checkAnswer(df.select(new Column("`column.with.dot`")).distinct(), Seq(Row(1), Row(2)))
    --- End diff --
    
    We can simplify this test:
    ```
    val df = Seq(1, 1, 2).toDF("column.with.dot")
    checkAnswer(df.distinct, Row(1) :: Row(2) :: Nil)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    **[Test build #60959 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60959/consoleFull)** for PR 13140 at commit [`48a6f50`](https://github.com/apache/spark/commit/48a6f50199415ebf436523daaf78bee99b553ea9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `final class Binarizer @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)`
      * `final class Bucketizer @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)`
      * `final class ChiSqSelector @Since(\"1.6.0\") (@Since(\"1.6.0\") override val uid: String)`
      * `class CountVectorizer @Since(\"1.5.0\") (@Since(\"1.5.0\") override val uid: String)`
      * `class CountVectorizerModel(`
      * `class DCT @Since(\"1.5.0\") (@Since(\"1.5.0\") override val uid: String)`
      * `class ElementwiseProduct @Since(\"2.0.0\") (@Since(\"2.0.0\") override val uid: String)`
      * `class HashingTF @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)`
      * `final class IDF @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)`
      * `class Interaction @Since(\"1.6.0\") (@Since(\"1.6.0\") override val uid: String) extends Transformer`
      * `class MaxAbsScaler @Since(\"2.0.0\") (@Since(\"2.0.0\") override val uid: String)`
      * `class MinMaxScaler @Since(\"1.5.0\") (@Since(\"1.5.0\") override val uid: String)`
      * `class NGram @Since(\"1.5.0\") (@Since(\"1.5.0\") override val uid: String)`
      * `class Normalizer @Since(\"2.0.0\") (@Since(\"2.0.0\") override val uid: String)`
      * `class OneHotEncoder @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String) extends Transformer`
      * `class PCA @Since(\"1.5.0\") (`
      * `class PolynomialExpansion @Since(\"2.0.0\") (@Since(\"2.0.0\") override val uid: String)`
      * `final class QuantileDiscretizer @Since(\"1.6.0\") (@Since(\"1.6.0\") override val uid: String)`
      * `class RFormula @Since(\"1.5.0\") (@Since(\"1.5.0\") override val uid: String)`
      * `class SQLTransformer @Since(\"1.6.0\") (@Since(\"1.6.0\") override val uid: String) extends Transformer`
      * `class StandardScaler @Since(\"1.4.0\") (`
      * `class StopWordsRemover @Since(\"1.5.0\") (@Since(\"1.5.0\") override val uid: String)`
      * `class StringIndexer @Since(\"1.4.0\") (`
      * `class Tokenizer @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)`
      * `class RegexTokenizer @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)`
      * `class VectorAssembler @Since(\"1.4.0\") (@Since(\"1.4.0\") override val uid: String)`
      * `class VectorIndexer @Since(\"1.4.0\") (`
      * `final class VectorSlicer @Since(\"1.5.0\") (@Since(\"1.5.0\") override val uid: String)`
      * `final class Word2Vec @Since(\"1.4.0\") (`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13140: [SPARK-15230] [SQL] distinct() does not handle column na...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13140
  
    **[Test build #60429 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60429/consoleFull)** for PR 13140 at commit [`2f7ffbd`](https://github.com/apache/spark/commit/2f7ffbd58a3437898f32e7603ca6b603f5fd5088).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org