You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by marmbrus <gi...@git.apache.org> on 2014/08/30 00:14:27 UTC

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

GitHub user marmbrus opened a pull request:

    https://github.com/apache/spark/pull/2209

    [SPARK-2890][SQL] Allow reading of data when case insensitive resolution could cause possible ambiguity.

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marmbrus/spark sameNameStruct

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2209.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2209
    
----
commit a703ff41743c566fee25c7121dbd077c6a52b021
Author: Michael Armbrust <mi...@databricks.com>
Date:   2014-08-29T22:05:04Z

    Allow reading of data when case insensitive resolution could cause possible ambiguity.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-55201477
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/40/consoleFull) for   PR 2209 at commit [`a703ff4`](https://github.com/apache/spark/commit/a703ff41743c566fee25c7121dbd077c6a52b021).
     * This patch **does not** merge cleanly!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-55509238
  
    Jenkins will actually show you how long the tests took, which can be helpful in narrowing down why we're seeing these timeouts.  In this case, it looks like the majority of the time is spent in certain Hive compatibility tests:
    
    https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/90/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-54695039
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19848/consoleFull) for   PR 2209 at commit [`a703ff4`](https://github.com/apache/spark/commit/a703ff41743c566fee25c7121dbd077c6a52b021).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-53936431
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19498/consoleFull) for   PR 2209 at commit [`a703ff4`](https://github.com/apache/spark/commit/a703ff41743c566fee25c7121dbd077c6a52b021).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-55506022
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/90/consoleFull) for   PR 2209 at commit [`729cca4`](https://github.com/apache/spark/commit/729cca4a038ab6dfbd538edc1eb24f78cdb52a7d).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-55510444
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/99/consoleFull) for   PR 2209 at commit [`729cca4`](https://github.com/apache/spark/commit/729cca4a038ab6dfbd538edc1eb24f78cdb52a7d).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2209


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-53975089
  
    I actually encountered the error with a jsonRDD, but yeah it could happen with parquet files as well.  Your comment about joins though makes me think that we should just get rid of this check entirely.  We can throw an error when your query is invalid, but throwing an exception just because at some point in a query something _could_ be ambiguous seems overly restrictive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-53975391
  
    Sounds good. I was not sure how to correctly query those results with ambiguous schemas when I added that check. Seems an more informative logging entry is better than an exception. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-55509181
  
    **[Tests timed out](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/90/consoleFull)** after     a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-53940609
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19498/consoleFull) for   PR 2209 at commit [`a703ff4`](https://github.com/apache/spark/commit/a703ff41743c566fee25c7121dbd077c6a52b021).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class LowerCaseSchema(child: LogicalPlan) extends UnaryNode with Logging `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-55206184
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/40/consoleFull) for   PR 2209 at commit [`a703ff4`](https://github.com/apache/spark/commit/a703ff41743c566fee25c7121dbd077c6a52b021).
     * This patch **fails** unit tests.
     * This patch **does not** merge cleanly!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2209#discussion_r16927258
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala ---
    @@ -158,11 +159,18 @@ case class Subquery(alias: String, child: LogicalPlan) extends UnaryNode {
      * this allows for optional case insensitive attribute resolution.  This node can be elided after
      * analysis.
      */
    -case class LowerCaseSchema(child: LogicalPlan) extends UnaryNode {
    +case class LowerCaseSchema(child: LogicalPlan) extends UnaryNode with Logging {
       protected def lowerCaseSchema(dataType: DataType): DataType = dataType match {
         case StructType(fields) =>
    -      StructType(fields.map(f =>
    -        StructField(f.name.toLowerCase(), lowerCaseSchema(f.dataType), f.nullable)))
    +      val convertedFields = fields.map(f =>
    +        StructField(f.name.toLowerCase, lowerCaseSchema(f.dataType), f.nullable))
    +      val deduplicatedFields = convertedFields.groupBy(_.name).map {
    +        case (fieldName, versions) if versions.size == 1  => versions.head
    +        case (fieldName, versions) if versions.size > 1  =>
    +          logWarning(s"Resolving attributes case insensitively is ambiguous for $fieldName")
    --- End diff --
    
    Provide more information on which column (with the original column name) we will keep in the `lowerCaseSchema`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-53945691
  
    Reading parquet files in `HiveContext` triggers the problem? If we have two columns `c1, C1`, we will not be able to read `C1` when we are using case insensitive resolution, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2209#discussion_r17380899
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala ---
    @@ -158,11 +159,18 @@ case class Subquery(alias: String, child: LogicalPlan) extends UnaryNode {
      * this allows for optional case insensitive attribute resolution.  This node can be elided after
      * analysis.
      */
    -case class LowerCaseSchema(child: LogicalPlan) extends UnaryNode {
    +case class LowerCaseSchema(child: LogicalPlan) extends UnaryNode with Logging {
       protected def lowerCaseSchema(dataType: DataType): DataType = dataType match {
         case StructType(fields) =>
    -      StructType(fields.map(f =>
    -        StructField(f.name.toLowerCase(), lowerCaseSchema(f.dataType), f.nullable)))
    +      val convertedFields = fields.map(f =>
    +        StructField(f.name.toLowerCase, lowerCaseSchema(f.dataType), f.nullable))
    +      val deduplicatedFields = convertedFields.groupBy(_.name).map {
    --- End diff --
    
    Ahh, this reorders the schema and breaks things. Props to @andyk.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-55509530
  
    @JoshRosen I am hoping that #2164 will fix the test time outs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-55076950
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-55223584
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/59/consoleFull) for   PR 2209 at commit [`a703ff4`](https://github.com/apache/spark/commit/a703ff41743c566fee25c7121dbd077c6a52b021).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class LowerCaseSchema(child: LogicalPlan) extends UnaryNode with Logging `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-55512344
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/99/consoleFull) for   PR 2209 at commit [`729cca4`](https://github.com/apache/spark/commit/729cca4a038ab6dfbd538edc1eb24f78cdb52a7d).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T] `
      * `  class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T] `
      * `  class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T] `
      * `  class Encoder extends compression.Encoder[IntegerType.type] `
      * `  class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[IntegerType.type])`
      * `  class Encoder extends compression.Encoder[LongType.type] `
      * `  class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[LongType.type])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-55792642
  
    Merged to master.  Thanks for looking this over!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-54699191
  
    **[Tests timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19848/consoleFull)** after     a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2209#issuecomment-55218135
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/59/consoleFull) for   PR 2209 at commit [`a703ff4`](https://github.com/apache/spark/commit/a703ff41743c566fee25c7121dbd077c6a52b021).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org