You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by rxin <gi...@git.apache.org> on 2015/04/14 08:15:04 UTC

[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

GitHub user rxin opened a pull request:

    https://github.com/apache/spark/pull/5505

    [SPARK-6865][SQL] DataFrame column names should be treated as string literals

    For example, "a.b" should match a column named `a.b`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rxin/spark resolve

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5505.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5505
    
----
commit 36f63a45d9971a0092a80642e426eec676428251
Author: Reynold Xin <rx...@databricks.com>
Date:   2015-04-14T06:13:19Z

    [SPARK-6865][SQL] DataFrame column names should be treated as string literals.
    
    For example, "a.b" should match a column named `a.b`.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5505#discussion_r28305663
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -117,6 +117,15 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
         resolve(name, children.flatMap(_.output), resolver, throwErrors)
     
       /**
    +   * Optionally resolves the given string (`name`) to a [[NamedExpression]] using the input
    +   * from all child nodes of this LogicalPlan. The given string is considered a string literal,
    +   * i.e. the string itself should match the attribute name and the attribute name alone.
    +   */
    +  def resolveQuoted(name: String, resolver: Resolver): Option[NamedExpression] = {
    +    output.find { attribute => resolver(name, attribute.name) }
    --- End diff --
    
    How about throw exception if the `name` matches more than one attributes? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5505#discussion_r28302683
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -121,7 +121,7 @@ class DataFrameSuite extends QueryTest {
         )
       }
     
    -  test("self join with aliases") {
    +  ignore("self join with aliases") {
    --- End diff --
    
    note @marmbrus our new resolver semantics breaks this test. Not sure how important it is.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5505#discussion_r28307206
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -117,6 +117,15 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
         resolve(name, children.flatMap(_.output), resolver, throwErrors)
     
       /**
    +   * Optionally resolves the given string (`name`) to a [[NamedExpression]] using the input
    +   * from all child nodes of this LogicalPlan. The given string is considered a string literal,
    +   * i.e. the string itself should match the attribute name and the attribute name alone.
    +   */
    +  def resolveQuoted(name: String, resolver: Resolver): Option[NamedExpression] = {
    +    output.find { attribute => resolver(name, attribute.name) }
    --- End diff --
    
    That's a good point, but I think for that case, the correct behavior should result in a failure during df creation (e.g. during analysis), not when we try to resolve a column later on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5505#discussion_r28348897
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -121,7 +121,7 @@ class DataFrameSuite extends QueryTest {
         )
       }
     
    -  test("self join with aliases") {
    +  ignore("self join with aliases") {
    --- End diff --
    
    The more I think about this, the more I am worried that we can't make a change this large.  There is no way to express self join queries if we don't handle `.` in column names.  We are also going to break lots of existing user code...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5505#discussion_r28349026
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -117,6 +117,15 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
         resolve(name, children.flatMap(_.output), resolver, throwErrors)
     
       /**
    +   * Optionally resolves the given string (`name`) to a [[NamedExpression]] using the input
    +   * from all child nodes of this LogicalPlan. The given string is considered a string literal,
    +   * i.e. the string itself should match the attribute name and the attribute name alone.
    +   */
    +  def resolveQuoted(name: String, resolver: Resolver): Option[NamedExpression] = {
    +    output.find { attribute => resolver(name, attribute.name) }
    --- End diff --
    
    You can't always know if something is going to be ambiguous when you create the DataFrame as it might only be ambiguous at query time due to a setting like case sensitivity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5505#discussion_r29122071
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala ---
    @@ -688,7 +688,7 @@ class ParquetDataSourceOnSourceSuite extends ParquetSourceSuiteBase {
         sql("DROP TABLE alwaysNullable")
       }
     
    -  test("Aggregation attribute names can't contain special chars \" ,;{}()\\n\\t=\"") {
    +  ignore("Aggregation attribute names can't contain special chars \" ,;{}()\\n\\t=\"") {
    --- End diff --
    
    I guess it should be OK to disable or even remove this test now, since now we check for invalid field names explicitly and suggest users to add aliases. See #5263.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5505#issuecomment-92635498
  
      [Test build #30221 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30221/consoleFull) for   PR 5505 at commit [`36f63a4`](https://github.com/apache/spark/commit/36f63a45d9971a0092a80642e426eec676428251).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5505#discussion_r28304919
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala ---
    @@ -688,7 +688,7 @@ class ParquetDataSourceOnSourceSuite extends ParquetSourceSuiteBase {
         sql("DROP TABLE alwaysNullable")
       }
     
    -  test("Aggregation attribute names can't contain special chars \" ,;{}()\\n\\t=\"") {
    +  ignore("Aggregation attribute names can't contain special chars \" ,;{}()\\n\\t=\"") {
    --- End diff --
    
    and @liancheng I had to disable this test as well since it used "tablename.columnname".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5505#issuecomment-92700970
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30227/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5505#discussion_r28306051
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -117,6 +117,15 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
         resolve(name, children.flatMap(_.output), resolver, throwErrors)
     
       /**
    +   * Optionally resolves the given string (`name`) to a [[NamedExpression]] using the input
    +   * from all child nodes of this LogicalPlan. The given string is considered a string literal,
    +   * i.e. the string itself should match the attribute name and the attribute name alone.
    +   */
    +  def resolveQuoted(name: String, resolver: Resolver): Option[NamedExpression] = {
    +    output.find { attribute => resolver(name, attribute.name) }
    --- End diff --
    
    is that even possible?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by rxin <gi...@git.apache.org>.
Github user rxin closed the pull request at:

    https://github.com/apache/spark/pull/5505


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5505#discussion_r28306610
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -117,6 +117,15 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
         resolve(name, children.flatMap(_.output), resolver, throwErrors)
     
       /**
    +   * Optionally resolves the given string (`name`) to a [[NamedExpression]] using the input
    +   * from all child nodes of this LogicalPlan. The given string is considered a string literal,
    +   * i.e. the string itself should match the attribute name and the attribute name alone.
    +   */
    +  def resolveQuoted(name: String, resolver: Resolver): Option[NamedExpression] = {
    +    output.find { attribute => resolver(name, attribute.name) }
    --- End diff --
    
    Maybe 2 columns with different names have same alias name by mistake? I'm not sure if `DataFrame` operations will do alias.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5505#issuecomment-96494320
  
    That one actually doesn't handle most self join cases, since very often in self joins you join on different keys.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5505#issuecomment-92664731
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30221/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5505#issuecomment-92671839
  
      [Test build #30227 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30227/consoleFull) for   PR 5505 at commit [`a5f6983`](https://github.com/apache/spark/commit/a5f69830d44477cf2aa0df9bfa9c234f723881e3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5505#issuecomment-92700931
  
      [Test build #30227 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30227/consoleFull) for   PR 5505 at commit [`a5f6983`](https://github.com/apache/spark/commit/a5f69830d44477cf2aa0df9bfa9c234f723881e3).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/5505#issuecomment-96492437
  
    As https://github.com/apache/spark/pull/5638 handled self join correctly, should we reopen this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5505#issuecomment-93182797
  
    I discussed with michael offline -- given this would break self-join, we've decided to treat [dot] (i.e. ".") as a special case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6865][SQL] DataFrame column names shoul...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5505#issuecomment-92664670
  
      [Test build #30221 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30221/consoleFull) for   PR 5505 at commit [`36f63a4`](https://github.com/apache/spark/commit/36f63a45d9971a0092a80642e426eec676428251).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org