You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by liancheng <gi...@git.apache.org> on 2015/02/12 11:36:53 UTC

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/4563

    [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet data source with the newly introduced write support for data source API

    This PR migrates the Parquet data source to the new data source write support API.  Now users can also overwriting and appending to existing tables. Notice that inserting into partitioned tables is not supported yet.
    
    When Parquet data source is enabled, insertion to Hive Metastore Parquet tables is also fullfilled by the Parquet data source. This is done by the newly introduced `HiveMetastoreCatalog.ParquetConversions` rule, which is a "proper" implementation of the original hacky `HiveStrategies.ParquetConversion`. The latter is still preserved, and can be removed together with the old Parquet support in the future.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark parquet-refining

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4563.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4563
    
----
commit c1a4e78302dba5c0df8b353f3dbfe732bbb6e56e
Author: Cheng Lian <li...@databricks.com>
Date:   2015-02-11T01:28:37Z

    Passes Hive Metastore partitioning information to ParquetRelation2

commit 62490304f2a04a54a9677de1e89d550001fe57ab
Author: Cheng Lian <li...@databricks.com>
Date:   2015-02-11T22:04:28Z

    Rewires Parquet data source and the new data source write support

commit 38ebd4ebe37027dacd2adee7fc305d67cdd56633
Author: Cheng Lian <li...@databricks.com>
Date:   2015-02-12T00:26:19Z

    Temporary solution for moving Parquet conversion to analysis phase
    
    Although it works, it's so ugly... I duplicated the whole Analyzer
    in Hive Context. Have to fix this.

commit f7b81da52028ecaceefc01cfb5ce6d471271935c
Author: Cheng Lian <li...@databricks.com>
Date:   2015-02-12T10:03:45Z

    Cleaner solution for Metastore Parquet table conversion

commit ae17ea8a3b1123845cd0be19cd1073bbaf9d0d7b
Author: Cheng Lian <li...@databricks.com>
Date:   2015-02-12T10:21:29Z

    Fixes compilation errors introduced during rebasing

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24572413
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
    @@ -244,6 +244,7 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) {
       override protected[sql] lazy val analyzer =
         new Analyzer(catalog, functionRegistry, caseSensitive = false) {
           override val extendedRules =
    +        catalog.ParquetConversions ::
    --- End diff --
    
    Moved Metastore Parquet table conversion to analysis phase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74476974
  
      [Test build #27544 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27544/consoleFull) for   PR 4563 at commit [`fa98d27`](https://github.com/apache/spark/commit/fa98d277e5294af4b8a6460d5f02379c33da714d).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class ParquetRelation2(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24620700
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -465,7 +498,7 @@ case class ParquetRelation2
         // before calling execute().
     
         val job = new Job(sqlContext.sparkContext.hadoopConfiguration)
    -    val writeSupport = if (schema.map(_.dataType).forall(_.isPrimitive)) {
    +    val writeSupport = if (parquetSchema.map(_.dataType).forall(_.isPrimitive)) {
    --- End diff --
    
    Right now, we have a schema check for `SaveMode.Append` in `CreateMetastoreDataSourceAsSelect`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24630177
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -106,12 +106,12 @@ class DefaultSource
           ParquetRelation.createEmpty(
             path,
             data.schema.toAttributes,
    -        false,
    +        true,
    --- End diff --
    
    Yeah, good catch, we shouldn't overwrite the original metadata here, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74413558
  
      [Test build #27517 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27517/consoleFull) for   PR 4563 at commit [`2476e82`](https://github.com/apache/spark/commit/2476e82132512e2aa8b601619a7a94fe8967806c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class ParquetRelation2(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24572271
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -304,13 +328,17 @@ case class ParquetRelation2
       @transient private val metadataCache = new MetadataCache
       metadataCache.refresh()
     
    -  private def partitionColumns = metadataCache.partitionSpec.partitionColumns
    +  def partitionSpec = metadataCache.partitionSpec
     
    -  private def partitions = metadataCache.partitionSpec.partitions
    +  def partitionColumns = metadataCache.partitionSpec.partitionColumns
     
    -  private def isPartitioned = partitionColumns.nonEmpty
    +  def partitions = metadataCache.partitionSpec.partitions
    --- End diff --
    
    These 3 were made public for testing purposes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24572328
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -413,18 +441,21 @@ case class ParquetRelation2
     
         // When the data does not include the key and the key is requested then we must fill it in
         // based on information from the input split.
    -    if (!dataSchemaIncludesPartitionKeys && partitionKeyLocations.nonEmpty) {
    +    if (!partitionKeysIncludedInDataSchema && partitionKeyLocations.nonEmpty) {
           baseRDD.mapPartitionsWithInputSplit { case (split: ParquetInputSplit, iterator) =>
             val partValues = selectedPartitions.collectFirst {
               case p if split.getPath.getParent.toString == p.path => p.values
             }.get
     
    +        val requiredPartOrdinal = partitionKeyLocations.keys.toSeq
    +
             iterator.map { pair =>
               val row = pair._2.asInstanceOf[SpecificMutableRow]
               var i = 0
    -          while (i < partValues.size) {
    +          while (i < requiredPartOrdinal.size) {
                 // TODO Avoids boxing cost here!
    -            row.update(partitionKeyLocations(i), partValues(i))
    +            val partOrdinal = requiredPartOrdinal(i)
    +            row.update(partitionKeyLocations(partOrdinal), partValues(partOrdinal))
    --- End diff --
    
    This is a bug fix, shouldn't always append *all* partition keys to the result row.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24572199
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -151,9 +151,12 @@ private[parquet] case class PartitionSpec(partitionColumns: StructType, partitio
      * discovery.
      */
     @DeveloperApi
    -case class ParquetRelation2
    -    (paths: Seq[String], parameters: Map[String, String], maybeSchema: Option[StructType] = None)
    -    (@transient val sqlContext: SQLContext)
    +case class ParquetRelation2(
    +    paths: Seq[String],
    +    parameters: Map[String, String],
    +    maybeSchema: Option[StructType] = None,
    +    maybePartitionSpec: Option[PartitionSpec] = None)(
    --- End diff --
    
    Right now only used when converting Hive Metastore Parquet tables. If this is defined, we don't do partition discovery below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24617694
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -187,31 +190,49 @@ case class ParquetRelation2
       @transient private val fs = FileSystem.get(sparkContext.hadoopConfiguration)
     
       private class MetadataCache {
    +    // `FileStatus` objects of all "_metadata" files.
         private var metadataStatuses: Array[FileStatus] = _
    +
    +    // `FileStatus` objects of all "_common_metadata" files.
         private var commonMetadataStatuses: Array[FileStatus] = _
    +
    +    // Parquet footer cache.
         private var footers: Map[FileStatus, Footer] = _
    -    private var parquetSchema: StructType = _
     
    +    // `FileStatus` objects of all data files (Parquet part-files).
         var dataStatuses: Array[FileStatus] = _
    +
    +    // Partition spec of this table, including names, data types, and values of each partition
    +    // column, and paths of each partition.
         var partitionSpec: PartitionSpec = _
    +
    +    // Schema of the actual Parquet files, without partition columns discovered from partition
    +    // directory paths.
    +    var parquetSchema: StructType = _
    +
    +    // Schema of the whole table, including partition columns.
         var schema: StructType = _
    -    var dataSchemaIncludesPartitionKeys: Boolean = _
     
    +    // Indicates whether partition columns are also included in Parquet data file schema.  If not,
    +    // we need to fill in partition column values into read rows when scanning the table.
    +    var partitionKeysIncludedInParquetSchema: Boolean = _
    +
    +    /**
    +     * Refreshes `FileStatus`es, footers, partition spec, and table schema.
    +     */
         def refresh(): Unit = {
    -      val baseStatuses = {
    -        val statuses = paths.distinct.map(p => fs.getFileStatus(fs.makeQualified(new Path(p))))
    -        // Support either reading a collection of raw Parquet part-files, or a collection of folders
    -        // containing Parquet files (e.g. partitioned Parquet table).
    -        assert(statuses.forall(!_.isDir) || statuses.forall(_.isDir))
    -        statuses.toArray
    -      }
    +      // Support either reading a collection of raw Parquet part-files, or a collection of folders
    +      // containing Parquet files (e.g. partitioned Parquet table).
    +      val baseStatuses = paths.distinct.map { p =>
    +        fs.getFileStatus(fs.makeQualified(new Path(p)))
    +      }.toArray
    +      assert(baseStatuses.forall(!_.isDir) || baseStatuses.forall(_.isDir))
     
           val leaves = baseStatuses.flatMap { f =>
             val statuses = SparkHadoopUtil.get.listLeafStatuses(fs, f.getPath).filter { f =>
               isSummaryFile(f.getPath) ||
                 !(f.getPath.getName.startsWith("_") || f.getPath.getName.startsWith("."))
             }
    --- End diff --
    
    Add a comment at here to explain what we are doing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24620616
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -465,7 +498,7 @@ case class ParquetRelation2
         // before calling execute().
     
         val job = new Job(sqlContext.sparkContext.hadoopConfiguration)
    -    val writeSupport = if (schema.map(_.dataType).forall(_.isPrimitive)) {
    +    val writeSupport = if (parquetSchema.map(_.dataType).forall(_.isPrimitive)) {
    --- End diff --
    
    What will happen if we do `save` with `SaveMode.Overwrite` and the given `DataFrame` has a different schema with the existing one?
    
    Should we use the schema of `data`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74058651
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27345/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74052039
  
      [Test build #27346 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27346/consoleFull) for   PR 4563 at commit [`efcc8d2`](https://github.com/apache/spark/commit/efcc8d2d0c22a0675e479068c41036256ba73d23).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74470473
  
    Many thanks to @yhuai, who helped getting rid of a bug in those Parquet test suites which disable Parquet data source and fall back to the old implementation!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24630294
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -465,7 +498,7 @@ case class ParquetRelation2
         // before calling execute().
     
         val job = new Job(sqlContext.sparkContext.hadoopConfiguration)
    -    val writeSupport = if (schema.map(_.dataType).forall(_.isPrimitive)) {
    +    val writeSupport = if (parquetSchema.map(_.dataType).forall(_.isPrimitive)) {
    --- End diff --
    
    But I should use `data.schema` rather than `parquetSchema` here...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74410721
  
      [Test build #27515 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27515/consoleFull) for   PR 4563 at commit [`a83d290`](https://github.com/apache/spark/commit/a83d29064155ea48442ed4a2f378f00d52fa75f5).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24619309
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -226,13 +247,14 @@ case class ParquetRelation2
             f -> new Footer(f.getPath, parquetMetadata)
           }.seq.toMap
     
    -      partitionSpec = {
    -        val partitionDirs = dataStatuses
    +      partitionSpec = maybePartitionSpec.getOrElse {
    +        val partitionDirs = leaves
    --- End diff --
    
    Is it possible that we will miss partitioning columns? If not, I think we can ignore entirely empty directories.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74470211
  
      [Test build #27544 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27544/consoleFull) for   PR 4563 at commit [`fa98d27`](https://github.com/apache/spark/commit/fa98d277e5294af4b8a6460d5f02379c33da714d).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24630219
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -226,13 +247,14 @@ case class ParquetRelation2
             f -> new Footer(f.getPath, parquetMetadata)
           }.seq.toMap
     
    -      partitionSpec = {
    -        val partitionDirs = dataStatuses
    +      partitionSpec = maybePartitionSpec.getOrElse {
    +        val partitionDirs = leaves
    --- End diff --
    
    No, number of partition columns must be the same among all partition directories.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24572495
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -380,13 +400,48 @@ private[hive] class HiveMetastoreCatalog(hive: HiveContext) extends Catalog with
         }
       }
     
    +  object ParquetConversions extends Rule[LogicalPlan] {
    --- End diff --
    
    Forgot to write ScalaDoc for this rule.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24571983
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -89,7 +89,7 @@ class DefaultSource
         val doSave = if (fs.exists(filesystemPath)) {
           mode match {
             case SaveMode.Append =>
    -          sys.error(s"Append mode is not supported by ${this.getClass.getCanonicalName}")
    +          true
    --- End diff --
    
    Enabling append.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74476986
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27544/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74410800
  
      [Test build #27515 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27515/consoleFull) for   PR 4563 at commit [`a83d290`](https://github.com/apache/spark/commit/a83d29064155ea48442ed4a2f378f00d52fa75f5).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class ParquetRelation2(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74410802
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27515/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24721072
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTest.scala ---
    @@ -23,8 +23,9 @@ import scala.reflect.ClassTag
     import scala.reflect.runtime.universe.TypeTag
     import scala.util.Try
     
    -import org.apache.spark.sql.{DataFrame, SQLContext}
     import org.apache.spark.sql.catalyst.util
    +import org.apache.spark.sql.sources.SaveMode
    --- End diff --
    
    Yeah, rebased.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24616222
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -106,12 +106,12 @@ class DefaultSource
           ParquetRelation.createEmpty(
             path,
             data.schema.toAttributes,
    -        false,
    +        true,
    --- End diff --
    
    Use named parameter for allowExisting. 
    
    Also, seems we will overwrite the metadata file. In the case of Append, do we still want to overwrite the metadata?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4563


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74481284
  
    @yhuai @marmbrus Thanks for the review! I'm gonna merge this as it contains a bunch of fixes and enables full power of the new Parquet data source :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74411260
  
      [Test build #27517 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27517/consoleFull) for   PR 4563 at commit [`2476e82`](https://github.com/apache/spark/commit/2476e82132512e2aa8b601619a7a94fe8967806c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74427223
  
    @yhuai @marmbrus All comments have been addressed, have to squash all commits to ease rebasing. Should be good to go. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24608107
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -106,12 +106,12 @@ class DefaultSource
           ParquetRelation.createEmpty(
    --- End diff --
    
    Currently we are still using some utility functions like this one from the old Parquet support code. We can move them into the new data source in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24715838
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
    @@ -244,6 +244,7 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) {
       override protected[sql] lazy val analyzer =
         new Analyzer(catalog, functionRegistry, caseSensitive = false) {
           override val extendedRules =
    +        catalog.ParquetConversions ::
    --- End diff --
    
    So much cleaner....


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24621161
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -381,12 +391,51 @@ private[hive] class HiveMetastoreCatalog(hive: HiveContext) extends Catalog with
       }
     
       /**
    +   * When scanning or writing to non-partitioned Metastore Parquet tables, convert them to Parquet
    +   * data source relations for better performance.
    +   */
    +  object ParquetConversions extends Rule[LogicalPlan] {
    +    override def apply(plan: LogicalPlan): LogicalPlan = {
    +      // Collects all `MetastoreRelation`s which should be replaced
    +      val toBeReplaced = plan.collect {
    +        case InsertIntoTable(relation: MetastoreRelation, _, _, _)
    +            // Inserting into partitioned table is not supported in Parquet data source (yet).
    +            if !relation.hiveQlTable.isPartitioned &&
    +              hive.convertMetastoreParquet &&
    +              hive.conf.parquetUseDataSourceApi &&
    +              relation.tableDesc.getSerdeClassName.toLowerCase.contains("parquet") =>
    +          relation
    +
    +        case p @ PhysicalOperation(_, _, relation: MetastoreRelation)
    --- End diff --
    
    Add a comment to explain when this one will be triggered?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24608728
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -226,13 +247,14 @@ case class ParquetRelation2
             f -> new Footer(f.getPath, parquetMetadata)
           }.seq.toMap
     
    -      partitionSpec = {
    -        val partitionDirs = dataStatuses
    +      partitionSpec = maybePartitionSpec.getOrElse {
    +        val partitionDirs = leaves
    --- End diff --
    
    Using all files rather than Parquet part-files only because there can be empty partitions. (Should we include entirely empty directories here? Namely directories don't even contain `_metadata` or `_common_metadata`.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74063265
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27346/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74050135
  
    cc @marmbrus @rxin @yhuai


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24630271
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -465,7 +498,7 @@ case class ParquetRelation2
         // before calling execute().
     
         val job = new Job(sqlContext.sparkContext.hadoopConfiguration)
    -    val writeSupport = if (schema.map(_.dataType).forall(_.isPrimitive)) {
    +    val writeSupport = if (parquetSchema.map(_.dataType).forall(_.isPrimitive)) {
    --- End diff --
    
    For `SaveMode.Overwrite`, we are always writing to an empty directory, so there's no "existing one".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24572471
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -183,21 +177,47 @@ private[hive] class HiveMetastoreCatalog(hive: HiveContext) extends Catalog with
           if (hive.convertMetastoreParquet &&
               hive.conf.parquetUseDataSourceApi &&
               relation.tableDesc.getSerdeClassName.toLowerCase.contains("parquet")) {
    -        val metastoreSchema = StructType.fromAttributes(relation.output)
    -        val paths = if (relation.hiveQlTable.isPartitioned) {
    -          relation.hiveQlPartitions.map(p => p.getLocation)
    -        } else {
    -          Seq(relation.hiveQlTable.getDataLocation.toString)
    -        }
    -
    -        LogicalRelation(ParquetRelation2(
    -          paths, Map(ParquetRelation2.METASTORE_SCHEMA -> metastoreSchema.json))(hive))
    +        // convertToParquetRelation(relation)
    +        relation
    --- End diff --
    
    Oops, should remove the whole `if` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74410672
  
    Squashed all commits to ease rebasing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74063259
  
      [Test build #27346 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27346/consoleFull) for   PR 4563 at commit [`efcc8d2`](https://github.com/apache/spark/commit/efcc8d2d0c22a0675e479068c41036256ba73d23).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class ParquetRelation2(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24572131
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -123,9 +123,9 @@ class DefaultSource
       }
     }
     
    -private[parquet] case class Partition(values: Row, path: String)
    +private[sql] case class Partition(values: Row, path: String)
     
    -private[parquet] case class PartitionSpec(partitionColumns: StructType, partitions: Seq[Partition])
    +private[sql] case class PartitionSpec(partitionColumns: StructType, partitions: Seq[Partition])
    --- End diff --
    
    These two are now private to `sql` package because I need them to pass Metastore partitioning information into `ParquetRelation2` while converting Metastore Parquet tables, and that part of code resides in `sql.hive` package.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74413560
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27517/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74058642
  
      [Test build #27345 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27345/consoleFull) for   PR 4563 at commit [`ae17ea8`](https://github.com/apache/spark/commit/ae17ea8a3b1123845cd0be19cd1073bbaf9d0d7b).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class ParquetRelation2(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24617792
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -187,31 +190,49 @@ case class ParquetRelation2
       @transient private val fs = FileSystem.get(sparkContext.hadoopConfiguration)
     
       private class MetadataCache {
    +    // `FileStatus` objects of all "_metadata" files.
         private var metadataStatuses: Array[FileStatus] = _
    +
    +    // `FileStatus` objects of all "_common_metadata" files.
         private var commonMetadataStatuses: Array[FileStatus] = _
    +
    +    // Parquet footer cache.
         private var footers: Map[FileStatus, Footer] = _
    -    private var parquetSchema: StructType = _
     
    +    // `FileStatus` objects of all data files (Parquet part-files).
         var dataStatuses: Array[FileStatus] = _
    +
    +    // Partition spec of this table, including names, data types, and values of each partition
    +    // column, and paths of each partition.
         var partitionSpec: PartitionSpec = _
    +
    +    // Schema of the actual Parquet files, without partition columns discovered from partition
    +    // directory paths.
    +    var parquetSchema: StructType = _
    +
    +    // Schema of the whole table, including partition columns.
         var schema: StructType = _
    -    var dataSchemaIncludesPartitionKeys: Boolean = _
     
    +    // Indicates whether partition columns are also included in Parquet data file schema.  If not,
    +    // we need to fill in partition column values into read rows when scanning the table.
    +    var partitionKeysIncludedInParquetSchema: Boolean = _
    +
    +    /**
    +     * Refreshes `FileStatus`es, footers, partition spec, and table schema.
    +     */
         def refresh(): Unit = {
    -      val baseStatuses = {
    -        val statuses = paths.distinct.map(p => fs.getFileStatus(fs.makeQualified(new Path(p))))
    -        // Support either reading a collection of raw Parquet part-files, or a collection of folders
    -        // containing Parquet files (e.g. partitioned Parquet table).
    -        assert(statuses.forall(!_.isDir) || statuses.forall(_.isDir))
    -        statuses.toArray
    -      }
    +      // Support either reading a collection of raw Parquet part-files, or a collection of folders
    +      // containing Parquet files (e.g. partitioned Parquet table).
    +      val baseStatuses = paths.distinct.map { p =>
    +        fs.getFileStatus(fs.makeQualified(new Path(p)))
    +      }.toArray
    +      assert(baseStatuses.forall(!_.isDir) || baseStatuses.forall(_.isDir))
     
           val leaves = baseStatuses.flatMap { f =>
    --- End diff --
    
    Maybe `leafSummaryFiles?`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4563#issuecomment-74050125
  
      [Test build #27345 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27345/consoleFull) for   PR 4563 at commit [`ae17ea8`](https://github.com/apache/spark/commit/ae17ea8a3b1123845cd0be19cd1073bbaf9d0d7b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24572003
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -106,12 +106,12 @@ class DefaultSource
           ParquetRelation.createEmpty(
             path,
             data.schema.toAttributes,
    -        false,
    +        true,
    --- End diff --
    
    Don't throw if the destination path already exists.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24735725
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
    @@ -139,15 +139,19 @@ private[hive] trait HiveStrategies {
     
                 val partitionLocations = partitions.map(_.getLocation)
     
    -            hiveContext
    -              .parquetFile(partitionLocations.head, partitionLocations.tail: _*)
    -              .addPartitioningAttributes(relation.partitionKeys)
    -              .lowerCase
    -              .where(unresolvedOtherPredicates)
    -              .select(unresolvedProjection: _*)
    -              .queryExecution
    -              .executedPlan
    -              .fakeOutput(projectList.map(_.toAttribute)) :: Nil
    +            if (partitionLocations.isEmpty) {
    +              PhysicalRDD(plan.output, sparkContext.emptyRDD[Row]) :: Nil
    +            } else {
    +              hiveContext
    +                .parquetFile(partitionLocations.head, partitionLocations.tail: _*)
    --- End diff --
    
    This is a bug fix. When no partition is selected, `partitionLocation.head` throws. In Spark 1.2, `parquetFile` accepts a single `path` argument. In this case, `parquetFile` throws an `IllegalArgumentException` since `path` is empty. This exception is then explicitly caught below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24616376
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -106,12 +106,12 @@ class DefaultSource
           ParquetRelation.createEmpty(
             path,
             data.schema.toAttributes,
    -        false,
    +        true,
             sqlContext.sparkContext.hadoopConfiguration,
             sqlContext)
     
           val createdRelation = createRelation(sqlContext, parameters, data.schema)
    -      createdRelation.asInstanceOf[ParquetRelation2].insert(data, true)
    +      createdRelation.asInstanceOf[ParquetRelation2].insert(data, mode == SaveMode.Overwrite)
    --- End diff --
    
    Use named parameter for overwrite.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/4563#issuecomment-74471070

Removed the `withSQLConf` trick in test suites like `ParquetQuerySuite`. A simplified version of the problem can be shown as:

```scala
withSQLConf(SQLConf.PARQUET_USE_DATA_SOURCE_API -> "false") {
test("some test") {
...
}
}
```

The execution order of this snippet is:

1. `SQLConf.PARQUET_USE_DATA_SOURCE_API` is set to `false`
2. A ScalaTest test case "some test" is created
3. `SQLConf.PARQUET_USE_DATA_SOURCE_API` is reverted (removed)
4. ScalaTest starts executing test case "some test"
5. "some test" executes with the default value of `SQLConf.PARQUET_USE_DATA_SOURCE_API` configuration, which is `true`

In the last commit, I removed the `withSQLConf` trick and fall back to `beforeAll`/`afterAll`. This introduced hundreds of lines of indentation changes in several test suites, thus made this PR twice as large.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24715821
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTest.scala ---
    @@ -23,8 +23,9 @@ import scala.reflect.ClassTag
     import scala.reflect.runtime.universe.TypeTag
     import scala.util.Try
     
    -import org.apache.spark.sql.{DataFrame, SQLContext}
     import org.apache.spark.sql.catalyst.util
    +import org.apache.spark.sql.sources.SaveMode
    --- End diff --
    
    Is this up to date?  I thought we moved `SaveMode` to `sql.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4563#discussion_r24612662
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -545,13 +578,16 @@ object ParquetRelation2 {
       // Whether we should merge schemas collected from all Parquet part-files.
       val MERGE_SCHEMA = "mergeSchema"
     
    -  // Hive Metastore schema, passed in when the Parquet relation is converted from Metastore
    +  // Hive Metastore schema, passed in when the Parquet relation is converted from Metastore.
       val METASTORE_SCHEMA = "metastoreSchema"
     
    -  // Default partition name to use when the partition column value is null or empty string
    +  // Schema of partition keys for partitioned Hive Metastore Parquet tables.
    +  val METASTORE_PARTITION_KEYS_SCHEMA = "metastorePartitionKeysSchema"
    --- End diff --
    
    Ah, forgot to remove this. It's not used any more since we construct `PartitionSpec` manually and pass it to `ParquetRelation2` when converting Metastore Parquet tables.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org