You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by yhuai <gi...@git.apache.org> on 2015/02/26 22:47:09 UTC

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

GitHub user yhuai opened a pull request:

    https://github.com/apache/spark/pull/4795

    [SPARK-6024][SQL] When a data source table has too many columns, it's schema cannot be stored in metastore.

    JIRA: https://issues.apache.org/jira/browse/SPARK-6024

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yhuai/spark wideSchema

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4795.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4795
    
----
commit e9b4f7015c7c4b8da4e5a4e355428c07b8cb175b
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-26T21:27:21Z

    Failed test.

commit 12bacaeffb28f6766891540673a531da744d9f77
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-26T21:45:39Z

    If the JSON string of a schema is too large, split it before storing it in metastore.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4795#issuecomment-76324955
  
    lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4795#issuecomment-76296274
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28020/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4795#discussion_r25484982
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -69,13 +69,22 @@ private[hive] class HiveMetastoreCatalog(hive: HiveContext) extends Catalog with
             val table = synchronized {
               client.getTable(in.database, in.name)
             }
    -        val schemaString = table.getProperty("spark.sql.sources.schema")
             val userSpecifiedSchema =
    -          if (schemaString == null) {
    -            None
    -          } else {
    -            Some(DataType.fromJson(schemaString).asInstanceOf[StructType])
    -          }
    +          Option(table.getProperty("spark.sql.sources.schema.numParts")).flatMap { numParts =>
    --- End diff --
    
    you don't need a flatMap here do you? just a map and remove the Some at the end


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4795#issuecomment-76297590
  
      [Test build #28022 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28022/consoleFull) for   PR 4795 at commit [`cc1d472`](https://github.com/apache/spark/commit/cc1d4724e2a8ab28f564265133638265497dd0e3).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4795#issuecomment-76299559
  
      [Test build #28025 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28025/consoleFull) for   PR 4795 at commit [`143927a`](https://github.com/apache/spark/commit/143927af48fdb67e1cb1fb22cdabc82df403f0db).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4795#issuecomment-76324478
  
      [Test build #28043 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28043/consoleFull) for   PR 4795 at commit [`4882e6f`](https://github.com/apache/spark/commit/4882e6f70a58120533397460e3023074490827a4).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4795#discussion_r25484986
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -69,13 +69,22 @@ private[hive] class HiveMetastoreCatalog(hive: HiveContext) extends Catalog with
             val table = synchronized {
               client.getTable(in.database, in.name)
             }
    -        val schemaString = table.getProperty("spark.sql.sources.schema")
             val userSpecifiedSchema =
    -          if (schemaString == null) {
    -            None
    -          } else {
    -            Some(DataType.fromJson(schemaString).asInstanceOf[StructType])
    -          }
    +          Option(table.getProperty("spark.sql.sources.schema.numParts")).flatMap { numParts =>
    +            val parts = (0 until numParts.toInt).map { index =>
    +              val part = table.getProperty(s"spark.sql.sources.schema.part.${index}")
    +              if (part == null) {
    +                throw new AnalysisException(
    +                  "Could not read schema from the metastore because it is corrupted.")
    +              }
    +
    +              part
    +            }
    +            // Stick all parts back to a single schema string in the JSON representation
    +            // and convert it back to a StructType.
    +            Some(DataType.fromJson(parts.mkString).asInstanceOf[StructType])
    +        }
    --- End diff --
    
    this line is misindented


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4795#issuecomment-76323886
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28031/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4795#issuecomment-76284907
  
      [Test build #28022 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28022/consoleFull) for   PR 4795 at commit [`cc1d472`](https://github.com/apache/spark/commit/cc1d4724e2a8ab28f564265133638265497dd0e3).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4795#issuecomment-76282138
  
      [Test build #28020 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28020/consoleFull) for   PR 4795 at commit [`12bacae`](https://github.com/apache/spark/commit/12bacaeffb28f6766891540673a531da744d9f77).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4795#discussion_r25482589
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -119,7 +125,15 @@ private[hive] class HiveMetastoreCatalog(hive: HiveContext) extends Catalog with
     
         tbl.setProperty("spark.sql.sources.provider", provider)
         if (userSpecifiedSchema.isDefined) {
    -      tbl.setProperty("spark.sql.sources.schema", userSpecifiedSchema.get.json)
    +      val threshold = hive.conf.schemaStringLengthThreshold
    +      val schemaJsonString = userSpecifiedSchema.get.json
    +      // Split the JSON string.
    +      val parts = schemaJsonString.grouped(threshold).toSeq
    +      tbl.setProperty("spark.sql.sources.schema.numOfParts", parts.size.toString)
    +      parts.zipWithIndex.foreach {
    +        case (part, index) =>
    --- End diff --
    
    move the case onto the previous line, i.e.
    
    ```scala
    parts.zipWithIndex.foreach { case (part, index) =>
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4795#issuecomment-76296269
  
      [Test build #28020 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28020/consoleFull) for   PR 4795 at commit [`12bacae`](https://github.com/apache/spark/commit/12bacaeffb28f6766891540673a531da744d9f77).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4795#issuecomment-76331070
  
    Merging in!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4795#issuecomment-76297601
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28022/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4795#issuecomment-76317036
  
      [Test build #28031 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28031/consoleFull) for   PR 4795 at commit [`73e71b4`](https://github.com/apache/spark/commit/73e71b4f12aa35cfd14f7c8e9d3748c516439c48).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4795#issuecomment-76310282
  
      [Test build #28025 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28025/consoleFull) for   PR 4795 at commit [`143927a`](https://github.com/apache/spark/commit/143927af48fdb67e1cb1fb22cdabc82df403f0db).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4795#issuecomment-76323883
  
      [Test build #28031 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28031/consoleFull) for   PR 4795 at commit [`73e71b4`](https://github.com/apache/spark/commit/73e71b4f12aa35cfd14f7c8e9d3748c516439c48).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4795#discussion_r25482564
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -69,13 +69,19 @@ private[hive] class HiveMetastoreCatalog(hive: HiveContext) extends Catalog with
             val table = synchronized {
               client.getTable(in.database, in.name)
             }
    -        val schemaString = table.getProperty("spark.sql.sources.schema")
    +        val schemaString = Option(table.getProperty("spark.sql.sources.schema.numOfParts")) match {
    +          case Some(numOfParts) =>
    +            val parts = (0 until numOfParts.toInt).map { index =>
    +              Option(table.getProperty(s"spark.sql.sources.schema.part.${index}"))
    +                .getOrElse("Could not read schema from the metastore because it is corrupted.")
    --- End diff --
    
    should probably throw an exception here instead of returning a string which gets mkstring-ed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4795#issuecomment-76330046
  
      [Test build #28043 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28043/consoleFull) for   PR 4795 at commit [`4882e6f`](https://github.com/apache/spark/commit/4882e6f70a58120533397460e3023074490827a4).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4795#discussion_r25475275
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -119,7 +131,26 @@ private[hive] class HiveMetastoreCatalog(hive: HiveContext) extends Catalog with
     
         tbl.setProperty("spark.sql.sources.provider", provider)
         if (userSpecifiedSchema.isDefined) {
    -      tbl.setProperty("spark.sql.sources.schema", userSpecifiedSchema.get.json)
    +      val threshold = hive.conf.schemaStringLengthThreshold
    +      val schemaJsonString = userSpecifiedSchema.get.json
    +      // Check if the size of the JSON string of the schema exceeds the threshold.
    +      if (schemaJsonString.size > threshold) {
    +        // Need to split the string.
    +        val parts = schemaJsonString.grouped(threshold).toSeq
    +        // First, record the total number of parts we have.
    +        tbl.setProperty("spark.sql.sources.schema.numOfParts", parts.size.toString)
    +        // Second, write every part to table property.
    +        parts.zipWithIndex.foreach {
    +          case (part, index) =>
    +            tbl.setProperty(s"spark.sql.sources.schema.part.${index}", part)
    +        }
    +      } else {
    +        // The length is less than the threshold, just put it in the table property.
    +        tbl.setProperty("spark.sql.sources.schema.numOfParts", "1")
    +        // We use spark.sql.sources.schema instead of using spark.sql.sources.schema.part.0
    +        // because users may have already created data source tables in metastore.
    +        tbl.setProperty("spark.sql.sources.schema", schemaJsonString)
    --- End diff --
    
    why don't we just always use schema.part.0 ? seems easier to consolidate the two code path


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4795#discussion_r25482557
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -69,13 +69,19 @@ private[hive] class HiveMetastoreCatalog(hive: HiveContext) extends Catalog with
             val table = synchronized {
               client.getTable(in.database, in.name)
             }
    -        val schemaString = table.getProperty("spark.sql.sources.schema")
    +        val schemaString = Option(table.getProperty("spark.sql.sources.schema.numOfParts")) match {
    --- End diff --
    
    I think it is more conventional to use numParts instead of numOfParts. Also you can remove the pattern matching by just applying a map.
     
    ```scala
    Option(table.getProperty("spark.sql.sources.schema.numParts")).map { numParts =>
      ...
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4795#issuecomment-76330062
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28043/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4795#discussion_r25484929
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -69,13 +69,22 @@ private[hive] class HiveMetastoreCatalog(hive: HiveContext) extends Catalog with
             val table = synchronized {
               client.getTable(in.database, in.name)
             }
    -        val schemaString = table.getProperty("spark.sql.sources.schema")
             val userSpecifiedSchema =
    -          if (schemaString == null) {
    -            None
    -          } else {
    -            Some(DataType.fromJson(schemaString).asInstanceOf[StructType])
    -          }
    +          Option(table.getProperty("spark.sql.sources.schema.numParts")).flatMap { numParts =>
    +            val parts = (0 until numParts.toInt).map { index =>
    +              val part = table.getProperty(s"spark.sql.sources.schema.part.${index}")
    +              if (part == null) {
    +                throw new AnalysisException(
    +                  "Could not read schema from the metastore because it is corrupted.")
    --- End diff --
    
    sorry for being picky, but it would be great to include the reason why it is corrupted (i.e. "missing part x")


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4795#discussion_r25475522
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -69,13 +69,25 @@ private[hive] class HiveMetastoreCatalog(hive: HiveContext) extends Catalog with
             val table = synchronized {
               client.getTable(in.database, in.name)
             }
    -        val schemaString = table.getProperty("spark.sql.sources.schema")
    -        val userSpecifiedSchema =
    -          if (schemaString == null) {
    -            None
    -          } else {
    -            Some(DataType.fromJson(schemaString).asInstanceOf[StructType])
    +        val schemaString = Option(table.getProperty("spark.sql.sources.schema"))
    +          .orElse {
    +            // If spark.sql.sources.schema is not defined, we either splitted the schema to multiple
    +            // parts or the schema was not defined. To determine if the schema was defined,
    +            // we check spark.sql.sources.schema.numOfParts.
    +            Option(table.getProperty("spark.sql.sources.schema.numOfParts")) match {
    --- End diff --
    
    we can easily consolidate this path and remove the convoluted Option.orElse followed by pattern matching.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4795


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6024][SQL] When a data source table has...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4795#issuecomment-76310289
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28025/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org