You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by liancheng <gi...@git.apache.org> on 2015/10/06 00:56:18 UTC

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/8988

    [SPARK-8848] [SQL] Refactors Parquet write path to follow parquet-format

    This PR refactors Parquet write path to follow parquet-format spec.  It's a successor of PR #7679, but with less non-essential changes.
    
    Major changes include:
    
    1.  Replaces `RowWriteSupport` and `MutableRowWriteSupport` with `CatalystWriteSupport`
    
        - Eliminates per value data type dispatching costs via prebuilt composed writer functions
        - Supports legacy mode which is compatible with Parquet format used by Spark 1.5 and prior versions
    
          The legacy mode is now by default on, and can be turned off by flipping SQL option `spark.sql.parquet.writeLegacyFormat` to `false`.
    
    1.  Cleans up the last pieces of old Parquet support code
    
    As pointed out by @rxin, we probably want to rename all those `Catalyst*` Parquet classes to `Parquet*` for clarity.  But I'd like to do this in a follow-up PR to minimize code review noises in this one.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark spark-8848/standard-parquet-write-path

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8988.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8988
    
----
commit d1583f88507a32afb509d33313d0cbe02de4897a
Author: Cheng Lian <li...@databricks.com>
Date:   2015-10-05T22:42:44Z

    Refactors Parquet write path to follow parquet-format

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146385921
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146379693
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146393689
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43365/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146393217
  
      [Test build #43360 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43360/console) for   PR 8988 at commit [`f03ef93`](https://github.com/apache/spark/commit/f03ef93e06c1241c69b49dd89d0b155b7ef87019).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146682933
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145713434
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43258/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145999926
  
      [Test build #43291 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43291/console) for   PR 8988 at commit [`a680f09`](https://github.com/apache/spark/commit/a680f09320aabd65f64fe071ed6b697862500eb9).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145960253
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146457243
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146708650
  
      [Test build #43427 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43427/console) for   PR 8988 at commit [`fb6ee9f`](https://github.com/apache/spark/commit/fb6ee9fc2d39688dcbdc55398013324ede708848).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145692361
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8988#discussion_r41301573
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala ---
    @@ -95,7 +95,9 @@ private[parquet] class CatalystReadSupport extends ReadSupport[InternalRow] with
            """.stripMargin
         }
     
    -    new CatalystRecordMaterializer(parquetRequestedSchema, catalystRequestedSchema)
    +    new CatalystRecordMaterializer(
    +      parquetRequestedSchema,
    +      CatalystReadSupport.expandUDT(catalystRequestedSchema))
    --- End diff --
    
    Expands UDTs early so that `CatalystRowConverter` always receive a Catalyst schema without UDTs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8988#discussion_r41544041
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala ---
    @@ -347,13 +350,13 @@ private[parquet] class CatalystSchemaConverter(
           // NOTE: Spark SQL TimestampType is NOT a well defined type in Parquet format spec.
           //
           // As stated in PARQUET-323, Parquet `INT96` was originally introduced to represent nanosecond
    -      // timestamp in Impala for some historical reasons, it's not recommended to be used for any
    -      // other types and will probably be deprecated in future Parquet format spec.  That's the
    -      // reason why Parquet format spec only defines `TIMESTAMP_MILLIS` and `TIMESTAMP_MICROS` which
    -      // are both logical types annotating `INT64`.
    +      // timestamp in Impala for some historical reasons.  It's not recommended to be used for any
    +      // other types and will probably be deprecated in some future version of parquet-format spec.
    +      // That's the reason why parquet-format spec only defines `TIMESTAMP_MILLIS` and
    +      // `TIMESTAMP_MICROS` which are both logical types annotating `INT64`.
           //
           // Originally, Spark SQL uses the same nanosecond timestamp type as Impala and Hive.  Starting
    -      // from Spark 1.5.0, we resort to a timestamp type with 100 ns precision so that we can store
    +      // from Spark 1.4.0, we resort to a timestamp type with 100 ns precision so that we can store
    --- End diff --
    
    This should be 1.5


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146700001
  
    @davies Although Hive doesn't write using standard Parquet format, it can read standard LIST and MAP. It just doesn't recognize compact decimals. So even if we turn off legacy mode, we can still interoperate with Hive as long as no compact decimals are written (either by disabling it explicit using extra SQL option, or by writing decimals with precision larger than 18). The benefit of adding an extra option is that we can still let Spark write standard Parquet files by default.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8988#discussion_r41551421
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala ---
    @@ -176,6 +178,28 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSQLContext {
         }
       }
     
    +  test("read standard Parquet file under legacy mode") {
    +    withTempPath { dir =>
    +      val path = dir.getCanonicalPath
    +
    +      withSQLConf(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key -> "false") {
    +        sqlContext
    +          .range(4)
    +          .selectExpr("NAMED_STRUCT('a', id, 'b', ARRAY(id, id + 1, id + 2)) AS s")
    +          .write
    +          .parquet(path)
    +      }
    +
    +      withSQLConf(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key -> "true") {
    --- End diff --
    
    Yeah, agree that this test is redundant. I'm removing this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146370096
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146675252
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145995412
  
      [Test build #43288 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43288/console) for   PR 8988 at commit [`d395bfd`](https://github.com/apache/spark/commit/d395bfd0bb73a7522e807e43f80e4307a65001a2).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146403816
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43372/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146681939
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146691882
  
    Since we already have an option for being compatible with Hive (the legacy mode), then we should not worry that (do not need to change anything in this PR).
    
    Hive 1.2 and Spark 1.4 will exists for a long time, If we plan to be compatible with them out of box (without any configurations), then we can't move forward.
    
    Parquet format 2 will have the same issue (compatibility).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146682895
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146371127
  
    @davies Thanks for the review. Turned legacy mode off by default, and made it a public option. Other offline comments are also addressed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146651234
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145745724
  
      [Test build #43272 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43272/consoleFull) for   PR 8988 at commit [`6fd20f7`](https://github.com/apache/spark/commit/6fd20f70baa535b1772c1c30a3f651ea673560f2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8988#discussion_r41547271
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala ---
    @@ -99,16 +99,18 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSQLContext {
         withSQLConf(SQLConf.PARQUET_BINARY_AS_STRING.key -> "true")(checkParquetFile(data))
       }
     
    -  test("fixed-length decimals") {
    -    def makeDecimalRDD(decimal: DecimalType): DataFrame =
    -      sparkContext
    -        .parallelize(0 to 1000)
    -        .map(i => Tuple1(i / 100.0))
    -        .toDF()
    -        // Parquet doesn't allow column names with spaces, have to add an alias here
    -        .select($"_1" cast decimal as "dec")
    +  testStandardAndLegacyModes("fixed-length decimals") {
    +    def makeDecimalRDD(decimal: DecimalType): DataFrame = {
    +      sqlContext
    +        .range(1000)
    +        // Parquet doesn't allow column names with spaces, have to add an alias here.
    +        // Minus 500 here so that negative decimals are also tested.
    +        .select((('id - 500) / 100.0) cast decimal as 'dec)
    --- End diff --
    
    The default decimal type will be (10, 0), should we use a larger scale (or the numbers will be rounded)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146367712
  
    @liancheng As we discussed offline, we should turn the legacy mode off by default, which is compatible for 1.4 and prior versions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8988#discussion_r41465749
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -282,33 +277,4 @@ private[sql] object ParquetFilters {
         addMethod.setAccessible(true)
         addMethod.invoke(null, classOf[Binary], enumTypeDescriptor)
       }
    -
    -  /**
    -   * Note: Inside the Hadoop API we only have access to `Configuration`, not to
    -   * [[org.apache.spark.SparkContext]], so we cannot use broadcasts to convey
    -   * the actual filter predicate.
    -   */
    -  def serializeFilterExpressions(filters: Seq[Expression], conf: Configuration): Unit = {
    -    if (filters.nonEmpty) {
    -      val serialized: Array[Byte] =
    -        SparkEnv.get.closureSerializer.newInstance().serialize(filters).array()
    -      val encoded: String = BaseEncoding.base64().encode(serialized)
    -      conf.set(PARQUET_FILTER_DATA, encoded)
    -    }
    -  }
    -
    -  /**
    -   * Note: Inside the Hadoop API we only have access to `Configuration`, not to
    -   * [[org.apache.spark.SparkContext]], so we cannot use broadcasts to convey
    -   * the actual filter predicate.
    -   */
    -  def deserializeFilterExpressions(conf: Configuration): Seq[Expression] = {
    -    val data = conf.get(PARQUET_FILTER_DATA)
    -    if (data != null) {
    -      val decoded: Array[Byte] = BaseEncoding.base64().decode(data)
    -      SparkEnv.get.closureSerializer.newInstance().deserialize(ByteBuffer.wrap(decoded))
    -    } else {
    -      Seq()
    -    }
    -  }
    --- End diff --
    
    This part of code is irrelevant to this PR, but it has been dead for a while, so remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146393293
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146431530
  
    Fixed failed test Hive test cases by enabling legacy mode explicitly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145951488
  
    Fixed a bug related to UDT: an exception is thrown when reading Parquet files containing UDT values under standard mode. Regression tests are added in [`UserDefinedTypeSuite`] [1].
    
    In 1.5 and earlier versions, when reading Parquet files containing UDT values, we pass schema containing UDT to `CatalystRowConverter` and translate UDTs into corresponding Parquet types there to create value converters. The problem is that, `CatalystRowConverter` isn't aware of standard/legacy mode, and always generate a Parquet type in legacy format. The last commit fixes this issue by expanding UDTs early in `CatalystRowMaterializer`, so that `CatalystRowConverter` doesn't need to worry about UDTs anymore.
    
    [1]: https://github.com/liancheng/spark/commit/d395bfd0bb73a7522e807e43f80e4307a65001a2#diff-5404bf307e20b3603a661e29a8a9f8fc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145745361
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145956501
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145960229
  
    The last Jenkins build failure was caused by artifact download failure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146391278
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43361/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146386880
  
      [Test build #43372 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43372/consoleFull) for   PR 8988 at commit [`af50f9c`](https://github.com/apache/spark/commit/af50f9c96f11f2413d0ba59cba3dc104f2a159be).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146393660
  
      [Test build #43365 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43365/console) for   PR 8988 at commit [`c542ae9`](https://github.com/apache/spark/commit/c542ae93eef917f310a3331a0bdf90cb6260db4f).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146651550
  
      [Test build #43417 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43417/consoleFull) for   PR 8988 at commit [`fb6ee9f`](https://github.com/apache/spark/commit/fb6ee9fc2d39688dcbdc55398013324ede708848).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146457244
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43387/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146391277
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145713344
  
      [Test build #43258 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43258/console) for   PR 8988 at commit [`d1583f8`](https://github.com/apache/spark/commit/d1583f88507a32afb509d33313d0cbe02de4897a).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146385909
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146714344
  
    Merged to master, thanks @davies for the detailed review!
    
    Finally fixed all the Parquet compatibility issues after 6 months!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146403789
  
      [Test build #43372 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43372/console) for   PR 8988 at commit [`af50f9c`](https://github.com/apache/spark/commit/af50f9c96f11f2413d0ba59cba3dc104f2a159be).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146650973
  
    @davies All comments addressed. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8988#discussion_r41432250
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetTypesConverter.scala ---
    @@ -1,160 +0,0 @@
    -/*
    - * Licensed to the Apache Software Foundation (ASF) under one or more
    - * contributor license agreements.  See the NOTICE file distributed with
    - * this work for additional information regarding copyright ownership.
    - * The ASF licenses this file to You under the Apache License, Version 2.0
    - * (the "License"); you may not use this file except in compliance with
    - * the License.  You may obtain a copy of the License at
    - *
    - *    http://www.apache.org/licenses/LICENSE-2.0
    - *
    - * Unless required by applicable law or agreed to in writing, software
    - * distributed under the License is distributed on an "AS IS" BASIS,
    - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    - * See the License for the specific language governing permissions and
    - * limitations under the License.
    - */
    -
    -package org.apache.spark.sql.execution.datasources.parquet
    -
    -import java.io.IOException
    -import java.util.{Collections, Arrays}
    -
    -import scala.util.Try
    -
    -import org.apache.hadoop.conf.Configuration
    -import org.apache.hadoop.fs.{FileSystem, Path}
    -import org.apache.hadoop.mapreduce.Job
    -import org.apache.parquet.format.converter.ParquetMetadataConverter
    -import org.apache.parquet.hadoop.metadata.{FileMetaData, ParquetMetadata}
    -import org.apache.parquet.hadoop.util.ContextUtil
    -import org.apache.parquet.hadoop.{Footer, ParquetFileReader, ParquetFileWriter}
    -import org.apache.parquet.schema.MessageType
    -
    -import org.apache.spark.Logging
    -import org.apache.spark.sql.catalyst.expressions.Attribute
    -import org.apache.spark.sql.types._
    -
    -
    -private[parquet] object ParquetTypesConverter extends Logging {
    -  def isPrimitiveType(ctype: DataType): Boolean = ctype match {
    -    case _: NumericType | BooleanType | DateType | TimestampType | StringType | BinaryType => true
    -    case _ => false
    -  }
    -
    -  /**
    -   * Compute the FIXED_LEN_BYTE_ARRAY length needed to represent a given DECIMAL precision.
    -   */
    -  private[parquet] val BYTES_FOR_PRECISION = Array.tabulate[Int](38) { precision =>
    -    var length = 1
    -    while (math.pow(2.0, 8 * length - 1) < math.pow(10.0, precision)) {
    -      length += 1
    -    }
    -    length
    -  }
    -
    -  def convertFromAttributes(attributes: Seq[Attribute]): MessageType = {
    -    val converter = new CatalystSchemaConverter()
    -    converter.convert(StructType.fromAttributes(attributes))
    -  }
    -
    -  def convertFromString(string: String): Seq[Attribute] = {
    -    Try(DataType.fromJson(string)).getOrElse(DataType.fromCaseClassString(string)) match {
    -      case s: StructType => s.toAttributes
    -      case other => sys.error(s"Can convert $string to row")
    -    }
    -  }
    -
    -  def convertToString(schema: Seq[Attribute]): String = {
    -    schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName)
    -    StructType.fromAttributes(schema).json
    -  }
    -
    -  def writeMetaData(attributes: Seq[Attribute], origPath: Path, conf: Configuration): Unit = {
    -    if (origPath == null) {
    -      throw new IllegalArgumentException("Unable to write Parquet metadata: path is null")
    -    }
    -    val fs = origPath.getFileSystem(conf)
    -    if (fs == null) {
    -      throw new IllegalArgumentException(
    -        s"Unable to write Parquet metadata: path $origPath is incorrectly formatted")
    -    }
    -    val path = origPath.makeQualified(fs)
    -    if (fs.exists(path) && !fs.getFileStatus(path).isDir) {
    -      throw new IllegalArgumentException(s"Expected to write to directory $path but found file")
    -    }
    -    val metadataPath = new Path(path, ParquetFileWriter.PARQUET_METADATA_FILE)
    -    if (fs.exists(metadataPath)) {
    -      try {
    -        fs.delete(metadataPath, true)
    -      } catch {
    -        case e: IOException =>
    -          throw new IOException(s"Unable to delete previous PARQUET_METADATA_FILE at $metadataPath")
    -      }
    -    }
    -    val extraMetadata = new java.util.HashMap[String, String]()
    -    extraMetadata.put(
    -      CatalystReadSupport.SPARK_METADATA_KEY,
    -      ParquetTypesConverter.convertToString(attributes))
    -    // TODO: add extra data, e.g., table name, date, etc.?
    -
    -    val parquetSchema: MessageType = ParquetTypesConverter.convertFromAttributes(attributes)
    -    val metaData: FileMetaData = new FileMetaData(
    -      parquetSchema,
    -      extraMetadata,
    -      "Spark")
    -
    -    ParquetFileWriter.writeMetadataFile(
    -      conf,
    -      path,
    -      Arrays.asList(new Footer(path, new ParquetMetadata(metaData, Collections.emptyList()))))
    -  }
    -
    -  /**
    -   * Try to read Parquet metadata at the given Path. We first see if there is a summary file
    -   * in the parent directory. If so, this is used. Else we read the actual footer at the given
    -   * location.
    -   * @param origPath The path at which we expect one (or more) Parquet files.
    -   * @param configuration The Hadoop configuration to use.
    -   * @return The `ParquetMetadata` containing among other things the schema.
    -   */
    -  def readMetaData(origPath: Path, configuration: Option[Configuration]): ParquetMetadata = {
    -    if (origPath == null) {
    -      throw new IllegalArgumentException("Unable to read Parquet metadata: path is null")
    -    }
    -    val job = new Job()
    -    val conf = {
    -      // scalastyle:off jobcontext
    -      configuration.getOrElse(ContextUtil.getConfiguration(job))
    -      // scalastyle:on jobcontext
    -    }
    -    val fs: FileSystem = origPath.getFileSystem(conf)
    -    if (fs == null) {
    -      throw new IllegalArgumentException(s"Incorrectly formatted Parquet metadata path $origPath")
    -    }
    -    val path = origPath.makeQualified(fs)
    -
    -    val children =
    -      fs
    -        .globStatus(path)
    -        .flatMap { status => if (status.isDir) fs.listStatus(status.getPath) else List(status) }
    -        .filterNot { status =>
    -          val name = status.getPath.getName
    -          (name(0) == '.' || name(0) == '_') && name != ParquetFileWriter.PARQUET_METADATA_FILE
    -        }
    -
    -    // NOTE (lian): Parquet "_metadata" file can be very slow if the file consists of lots of row
    -    // groups. Since Parquet schema is replicated among all row groups, we only need to touch a
    -    // single row group to read schema related metadata. Notice that we are making assumptions that
    -    // all data in a single Parquet file have the same schema, which is normally true.
    -    children
    -      // Try any non-"_metadata" file first...
    -      .find(_.getPath.getName != ParquetFileWriter.PARQUET_METADATA_FILE)
    -      // ... and fallback to "_metadata" if no such file exists (which implies the Parquet file is
    -      // empty, thus normally the "_metadata" file is expected to be fairly small).
    -      .orElse(children.find(_.getPath.getName == ParquetFileWriter.PARQUET_METADATA_FILE))
    -      .map(ParquetFileReader.readFooter(conf, _, ParquetMetadataConverter.NO_FILTER))
    -      .getOrElse(
    -        throw new IllegalArgumentException(s"Could not find Parquet metadata at path $path"))
    -  }
    -}
    --- End diff --
    
    This file had once been where the old Parquet-Catalyst type conversion code stayed. But that part of code had already been removed when we migrated to `CatalystSchemaConverter`. The rest of this file was used by the old Parquet write path. And now we can finally get rid of it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146675183
  
      [Test build #43417 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43417/console) for   PR 8988 at commit [`fb6ee9f`](https://github.com/apache/spark/commit/fb6ee9fc2d39688dcbdc55398013324ede708848).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8988#discussion_r41421929
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala ---
    @@ -271,4 +276,30 @@ private[parquet] object CatalystReadSupport {
             .getOrElse(toParquet.convertField(f))
         }
       }
    +
    +  def expandUDT(schema: StructType): StructType = {
    +    def expand(dataType: DataType): DataType = {
    +      dataType match {
    +        case t: ArrayType =>
    +          t.copy(elementType = expand(t.elementType))
    +
    +        case t: MapType =>
    +          t.copy(
    +            keyType = expand(t.keyType),
    +            valueType = expand(t.valueType))
    +
    +        case t: StructType =>
    +          val expandedFields = t.fields.map(f => f.copy(dataType = expand(f.dataType)))
    +          t.copy(fields = expandedFields)
    +
    +        case t: UserDefinedType[_] =>
    +          t.sqlType
    +
    +        case t =>
    +          t
    +      }
    +    }
    +
    +    expand(schema).asInstanceOf[StructType]
    +  }
    --- End diff --
    
    Maybe not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8988#discussion_r41300287
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala ---
    @@ -271,4 +276,30 @@ private[parquet] object CatalystReadSupport {
             .getOrElse(toParquet.convertField(f))
         }
       }
    +
    +  def expandUDT(schema: StructType): StructType = {
    +    def expand(dataType: DataType): DataType = {
    +      dataType match {
    +        case t: ArrayType =>
    +          t.copy(elementType = expand(t.elementType))
    +
    +        case t: MapType =>
    +          t.copy(
    +            keyType = expand(t.keyType),
    +            valueType = expand(t.valueType))
    +
    +        case t: StructType =>
    +          val expandedFields = t.fields.map(f => f.copy(dataType = expand(f.dataType)))
    +          t.copy(fields = expandedFields)
    +
    +        case t: UserDefinedType[_] =>
    +          t.sqlType
    +
    +        case t =>
    +          t
    +      }
    +    }
    +
    +    expand(schema).asInstanceOf[StructType]
    +  }
    --- End diff --
    
    Not sure whether this method is useful enough to be added as methods of all complex data types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145995572
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146379997
  
      [Test build #43365 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43365/consoleFull) for   PR 8988 at commit [`c542ae9`](https://github.com/apache/spark/commit/c542ae93eef917f310a3331a0bdf90cb6260db4f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145968947
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146708774
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146393688
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146365583
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145767542
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146391252
  
      [Test build #43361 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43361/console) for   PR 8988 at commit [`2bc5ebc`](https://github.com/apache/spark/commit/2bc5ebcf8c473817af68b056f772eb77a48acb07).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146370620
  
      [Test build #43360 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43360/consoleFull) for   PR 8988 at commit [`f03ef93`](https://github.com/apache/spark/commit/f03ef93e06c1241c69b49dd89d0b155b7ef87019).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146432002
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145949037
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146431992
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145961804
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145767439
  
      [Test build #43272 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43272/console) for   PR 8988 at commit [`6fd20f7`](https://github.com/apache/spark/commit/6fd20f70baa535b1772c1c30a3f651ea673560f2).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146681921
  
    The last build failure was caused by #8983, which broke master and has just been reverted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146369981
  
      [Test build #43355 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43355/console) for   PR 8988 at commit [`5b08a20`](https://github.com/apache/spark/commit/5b08a20ead9d401232855bb2af54ae54696ef17c).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145951351
  
      [Test build #43284 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43284/consoleFull) for   PR 8988 at commit [`d395bfd`](https://github.com/apache/spark/commit/d395bfd0bb73a7522e807e43f80e4307a65001a2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146403814
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145962058
  
      [Test build #43288 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43288/consoleFull) for   PR 8988 at commit [`d395bfd`](https://github.com/apache/spark/commit/d395bfd0bb73a7522e807e43f80e4307a65001a2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146367289
  
      [Test build #43355 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43355/consoleFull) for   PR 8988 at commit [`5b08a20`](https://github.com/apache/spark/commit/5b08a20ead9d401232855bb2af54ae54696ef17c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8988#discussion_r41431958
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystWriteSupport.scala ---
    @@ -0,0 +1,428 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.parquet
    +
    +import java.nio.{ByteBuffer, ByteOrder}
    +import java.util
    +
    +import scala.collection.JavaConverters.mapAsJavaMapConverter
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.parquet.column.ParquetProperties
    +import org.apache.parquet.hadoop.ParquetOutputFormat
    +import org.apache.parquet.hadoop.api.WriteSupport
    +import org.apache.parquet.hadoop.api.WriteSupport.WriteContext
    +import org.apache.parquet.io.api.{Binary, RecordConsumer}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.sql.SQLConf
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.{SpecializedGetters, SpecificMutableRow}
    +import org.apache.spark.sql.catalyst.util.DateTimeUtils
    +import org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.{MAX_PRECISION_FOR_INT32, MAX_PRECISION_FOR_INT64, minBytesForPrecision}
    +import org.apache.spark.sql.types._
    +
    +/**
    + * A Parquet [[WriteSupport]] implementation that writes Catalyst [[InternalRow]]s as Parquet
    + * messages.  This class can write Parquet data in two modes:
    + *
    + *  - Standard mode: Parquet data are written in standard format defined in parquet-format spec.
    + *  - Legacy mode: Parquet data are written in legacy format compatible with Spark 1.5 and prior.
    + *
    + * This behavior can be controlled by SQL option `spark.sql.parquet.writeLegacyParquetFormat`.  The
    + * value of the option is propagated to this class by the `init()` method and its Hadoop
    + * configuration argument.
    + */
    +private[parquet] class CatalystWriteSupport extends WriteSupport[InternalRow] with Logging {
    +  // A `ValueWriter` is responsible for writing a field of an `InternalRow` to the record consumer.
    +  // Here we are using `SpecializedGetters` rather than `InternalRow` so that we can directly access
    +  // data in `ArrayData` without the help of `SpecificMutableRow`.
    +  private type ValueWriter = (SpecializedGetters, Int) => Unit
    +
    +  // Schema of the `InternalRow`s to be written
    +  private var schema: StructType = _
    +
    +  // `ValueWriter`s for all fields of the schema
    +  private var rootFieldWriters: Seq[ValueWriter] = _
    +
    +  // The Parquet `RecordConsumer` to which all `InternalRow`s are written
    +  private var recordConsumer: RecordConsumer = _
    +
    +  // Whether to write data in legacy Parquet format compatible with Spark 1.5 and prior versions
    +  private var writeLegacyParquetFormat: Boolean = _
    +
    +  // Reusable byte array used to write timestamps as Parquet INT96 values
    +  private val timestampBuffer = new Array[Byte](12)
    +
    +  // Reusable byte array used to write decimal values
    +  private val decimalBuffer = new Array[Byte](minBytesForPrecision(DecimalType.MAX_PRECISION))
    +
    +  override def init(configuration: Configuration): WriteContext = {
    +    val schemaString = configuration.get(CatalystWriteSupport.SPARK_ROW_SCHEMA)
    +    this.schema = StructType.fromString(schemaString)
    +    this.writeLegacyParquetFormat = {
    +      // `SQLConf.PARQUET_WRITE_LEGACY_FORMAT` should always be explicitly set in ParquetRelation
    +      assert(configuration.get(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key) != null)
    +      configuration.get(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key).toBoolean
    +    }
    +    this.rootFieldWriters = schema.map(_.dataType).map(makeWriter)
    +
    +    val messageType = new CatalystSchemaConverter(configuration).convert(schema)
    +    val metadata = Map(CatalystReadSupport.SPARK_METADATA_KEY -> schemaString).asJava
    +
    +    logInfo(
    +      s"""Initialized Parquet WriteSupport with Catalyst schema:
    +         |${schema.prettyJson}
    +         |and corresponding Parquet message type:
    +         |$messageType
    +       """.stripMargin)
    +
    +    new WriteContext(messageType, metadata)
    +  }
    +
    +  override def prepareForWrite(recordConsumer: RecordConsumer): Unit = {
    +    this.recordConsumer = recordConsumer
    +  }
    +
    +  override def write(row: InternalRow): Unit = {
    +    consumeMessage(writeFields(row, schema, rootFieldWriters))
    +  }
    +
    +  private def writeFields(
    +      row: InternalRow, schema: StructType, fieldWriters: Seq[ValueWriter]): Unit = {
    +    var i = 0
    +    while (i < row.numFields) {
    +      if (!row.isNullAt(i)) {
    +        consumeField(schema(i).name, i) {
    +          fieldWriters(i).apply(row, i)
    +        }
    +      }
    +      i += 1
    +    }
    +  }
    +
    +  private def makeWriter(dataType: DataType): ValueWriter = {
    +    dataType match {
    +      case BooleanType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addBoolean(row.getBoolean(ordinal))
    +
    +      case ByteType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addInteger(row.getByte(ordinal))
    +
    +      case ShortType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addInteger(row.getShort(ordinal))
    +
    +      case IntegerType | DateType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addInteger(row.getInt(ordinal))
    +
    +      case LongType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addLong(row.getLong(ordinal))
    +
    +      case FloatType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addFloat(row.getFloat(ordinal))
    +
    +      case DoubleType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addDouble(row.getDouble(ordinal))
    +
    +      case StringType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addBinary(Binary.fromByteArray(row.getUTF8String(ordinal).getBytes))
    +
    +      case TimestampType =>
    +        (row: SpecializedGetters, ordinal: Int) => {
    +          // TODO Writes `TimestampType` values as `TIMESTAMP_MICROS` once parquet-mr implements it
    +          // Currently we only support timestamps stored as INT96, which is compatible with Hive
    +          // and Impala.  However, INT96 is to be deprecated.  We plan to support `TIMESTAMP_MICROS`
    +          // defined in the parquet-format spec.  But up until writing, the most recent parquet-mr
    +          // version (1.8.1) hasn't implemented it yet.
    +
    +          // NOTE: Starting from Spark 1.5, Spark SQL `TimestampType` only has microsecond
    +          // precision.  Nanosecond parts of timestamp values read from INT96 are simply stripped.
    +          val (julianDay, timeOfDayNanos) = DateTimeUtils.toJulianDay(row.getLong(ordinal))
    +          val buf = ByteBuffer.wrap(timestampBuffer)
    +          buf.order(ByteOrder.LITTLE_ENDIAN).putLong(timeOfDayNanos).putInt(julianDay)
    +          recordConsumer.addBinary(Binary.fromByteArray(timestampBuffer))
    +        }
    +
    +      case BinaryType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addBinary(Binary.fromByteArray(row.getBinary(ordinal)))
    +
    +      case DecimalType.Fixed(precision, scale) =>
    +        makeDecimalWriter(precision, scale)
    +
    +      case t: StructType =>
    +        val fieldWriters = t.map(_.dataType).map(makeWriter)
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          consumeGroup(writeFields(row.getStruct(ordinal, t.length), t, fieldWriters))
    +
    +      case t: ArrayType => makeArrayWriter(t)
    +
    +      case t: MapType => makeMapWriter(t)
    +
    +      case t: UserDefinedType[_] => makeWriter(t.sqlType)
    +
    +      // TODO Adds IntervalType support
    +      case _ => sys.error(s"Unsupported data type $dataType.")
    +    }
    +  }
    +
    +  private def makeDecimalWriter(precision: Int, scale: Int): ValueWriter = {
    +    assert(
    +      precision <= DecimalType.MAX_PRECISION,
    +      s"Decimal precision $precision exceeds max precision ${DecimalType.MAX_PRECISION}")
    +
    +    val numBytes = minBytesForPrecision(precision)
    +
    +    val int32Writer =
    +      (row: SpecializedGetters, ordinal: Int) =>
    +        recordConsumer.addInteger(row.getLong(ordinal).toInt)
    +
    +    val int64Writer =
    +      (row: SpecializedGetters, ordinal: Int) =>
    +        recordConsumer.addLong(row.getLong(ordinal))
    +
    +    val binaryWriterUsingUnscaledLong =
    +      (row: SpecializedGetters, ordinal: Int) => {
    +        // When the precision is low enough (<= 18) to squeeze the decimal value into a `Long`, we
    +        // can build a fixed-length byte array with length `numBytes` using the unscaled `Long`
    +        // value and the `decimalBuffer` for better performance.
    +        val unscaled = row.getDecimal(ordinal, precision, scale).toUnscaledLong
    +        var i = 0
    +        var shift = 8 * (numBytes - 1)
    +
    +        while (i < numBytes) {
    +          decimalBuffer(i) = (unscaled >> shift).toByte
    +          i += 1
    +          shift -= 8
    +        }
    +
    +        recordConsumer.addBinary(Binary.fromByteArray(decimalBuffer, 0, numBytes))
    +      }
    +
    +    val binaryWriterUsingUnscaledBytes =
    +      (row: SpecializedGetters, ordinal: Int) => {
    +        val decimal = row.getDecimal(ordinal, precision, scale)
    +        val bytes = decimal.toJavaBigDecimal.unscaledValue().toByteArray
    +        val fixedLengthBytes = if (bytes.length == numBytes) {
    +          // If the length of the underlying byte array of the unscaled `BigInteger` happens to be
    +          // `numBytes`, just reuse it, so that we don't bother copying it to `decimalBuffer`.
    +          bytes
    +        } else {
    +          // Otherwise, the length must be less than `numBytes`.  In this case we copy contents of
    +          // the underlying bytes with padding sign bytes to `decimalBuffer` to form the result
    +          // fixed-length byte array.
    +          val signByte = if (bytes.head < 0) -1: Byte else 0: Byte
    +          util.Arrays.fill(decimalBuffer, 0, numBytes - bytes.length, signByte)
    +          System.arraycopy(bytes, 0, decimalBuffer, numBytes - bytes.length, bytes.length)
    +          decimalBuffer
    +        }
    +
    +        recordConsumer.addBinary(Binary.fromByteArray(fixedLengthBytes, 0, numBytes))
    +      }
    +
    +    writeLegacyParquetFormat match {
    +      // Standard mode, 1 <= precision <= 9, writes as INT32
    +      case false if precision <= MAX_PRECISION_FOR_INT32 => int32Writer
    +
    +      // Standard mode, 10 <= precision <= 18, writes as INT64
    +      case false if precision <= MAX_PRECISION_FOR_INT64 => int64Writer
    +
    +      // Legacy mode, 1 <= precision <= 18, writes as FIXED_LEN_BYTE_ARRAY
    +      case true if precision <= MAX_PRECISION_FOR_INT64 => binaryWriterUsingUnscaledLong
    +
    +      // Either standard or legacy mode, 19 <= precision <= 38, writes as FIXED_LEN_BYTE_ARRAY
    +      case _ => binaryWriterUsingUnscaledBytes
    +    }
    +  }
    +
    +  def makeArrayWriter(arrayType: ArrayType): ValueWriter = {
    +    val elementWriter = makeWriter(arrayType.elementType)
    +
    +    def threeLevelArrayWriter(repeatedGroupName: String, elementFieldName: String): ValueWriter =
    +      (row: SpecializedGetters, ordinal: Int) => {
    +        val array = row.getArray(ordinal)
    +        consumeGroup {
    +          // Only creates the repeated field if the array is non-empty.
    --- End diff --
    
    Note that this is because Parquet doesn't allow writing empty fields. (But empty groups are OK.) The same applies to similar code below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146708775
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43427/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146642923
  
    LGTM, with except some minor comments, could you also update the PR description? (1.5 -> 1.4)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146365511
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146433122
  
      [Test build #43387 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43387/consoleFull) for   PR 8988 at commit [`e67d0b1`](https://github.com/apache/spark/commit/e67d0b1082b9a9a0b268875826889c8a945e0109).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146683177
  
      [Test build #43427 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43427/consoleFull) for   PR 8988 at commit [`fb6ee9f`](https://github.com/apache/spark/commit/fb6ee9fc2d39688dcbdc55398013324ede708848).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146378008
  
    The last build failure was caused by a flaky artifact downloading failure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8988#discussion_r41544714
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystWriteSupport.scala ---
    @@ -0,0 +1,432 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.parquet
    +
    +import java.nio.{ByteBuffer, ByteOrder}
    +import java.util
    +
    +import scala.collection.JavaConverters.mapAsJavaMapConverter
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.parquet.column.ParquetProperties
    +import org.apache.parquet.hadoop.ParquetOutputFormat
    +import org.apache.parquet.hadoop.api.WriteSupport
    +import org.apache.parquet.hadoop.api.WriteSupport.WriteContext
    +import org.apache.parquet.io.api.{Binary, RecordConsumer}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.sql.SQLConf
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.SpecializedGetters
    +import org.apache.spark.sql.catalyst.util.DateTimeUtils
    +import org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.{MAX_PRECISION_FOR_INT32, MAX_PRECISION_FOR_INT64, minBytesForPrecision}
    +import org.apache.spark.sql.types._
    +
    +/**
    + * A Parquet [[WriteSupport]] implementation that writes Catalyst [[InternalRow]]s as Parquet
    + * messages.  This class can write Parquet data in two modes:
    + *
    + *  - Standard mode: Parquet data are written in standard format defined in parquet-format spec.
    + *  - Legacy mode: Parquet data are written in legacy format compatible with Spark 1.4 and prior.
    + *
    + * This behavior can be controlled by SQL option `spark.sql.parquet.writeLegacyFormat`.  The value
    + * of this option is propagated to this class by the `init()` method and its Hadoop configuration
    + * argument.
    + */
    +private[parquet] class CatalystWriteSupport extends WriteSupport[InternalRow] with Logging {
    +  // A `ValueWriter` is responsible for writing a field of an `InternalRow` to the record consumer.
    +  // Here we are using `SpecializedGetters` rather than `InternalRow` so that we can directly access
    +  // data in `ArrayData` without the help of `SpecificMutableRow`.
    +  private type ValueWriter = (SpecializedGetters, Int) => Unit
    +
    +  // Schema of the `InternalRow`s to be written
    +  private var schema: StructType = _
    +
    +  // `ValueWriter`s for all fields of the schema
    +  private var rootFieldWriters: Seq[ValueWriter] = _
    +
    +  // The Parquet `RecordConsumer` to which all `InternalRow`s are written
    +  private var recordConsumer: RecordConsumer = _
    +
    +  // Whether to write data in legacy Parquet format compatible with Spark 1.4 and prior versions
    +  private var writeLegacyParquetFormat: Boolean = _
    +
    +  // Reusable byte array used to write timestamps as Parquet INT96 values
    +  private val timestampBuffer = new Array[Byte](12)
    +
    +  // Reusable byte array used to write decimal values
    +  private val decimalBuffer = new Array[Byte](minBytesForPrecision(DecimalType.MAX_PRECISION))
    +
    +  override def init(configuration: Configuration): WriteContext = {
    +    val schemaString = configuration.get(CatalystWriteSupport.SPARK_ROW_SCHEMA)
    +    this.schema = StructType.fromString(schemaString)
    +    this.writeLegacyParquetFormat = {
    +      // `SQLConf.PARQUET_WRITE_LEGACY_FORMAT` should always be explicitly set in ParquetRelation
    +      assert(configuration.get(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key) != null)
    +      configuration.get(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key).toBoolean
    +    }
    +    this.rootFieldWriters = schema.map(_.dataType).map(makeWriter)
    +
    +    val messageType = new CatalystSchemaConverter(configuration).convert(schema)
    +    val metadata = Map(CatalystReadSupport.SPARK_METADATA_KEY -> schemaString).asJava
    +
    +    logInfo(
    +      s"""Initialized Parquet WriteSupport with Catalyst schema:
    +         |${schema.prettyJson}
    +         |and corresponding Parquet message type:
    +         |$messageType
    +       """.stripMargin)
    +
    +    new WriteContext(messageType, metadata)
    +  }
    +
    +  override def prepareForWrite(recordConsumer: RecordConsumer): Unit = {
    +    this.recordConsumer = recordConsumer
    +  }
    +
    +  override def write(row: InternalRow): Unit = {
    +    consumeMessage(writeFields(row, schema, rootFieldWriters))
    --- End diff --
    
    This is a littble bit confusing, maybe we could write it like this?
    ```
    consumeMessage {
      writeFields(row, schema, rootFieldWriters)
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145961734
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145767547
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43272/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8988#discussion_r41421815
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala ---
    @@ -271,4 +276,30 @@ private[parquet] object CatalystReadSupport {
             .getOrElse(toParquet.convertField(f))
         }
       }
    +
    +  def expandUDT(schema: StructType): StructType = {
    +    def expand(dataType: DataType): DataType = {
    +      dataType match {
    +        case t: ArrayType =>
    +          t.copy(elementType = expand(t.elementType))
    +
    +        case t: MapType =>
    +          t.copy(
    +            keyType = expand(t.keyType),
    +            valueType = expand(t.valueType))
    +
    +        case t: StructType =>
    +          val expandedFields = t.fields.map(f => f.copy(dataType = expand(f.dataType)))
    +          t.copy(fields = expandedFields)
    +
    +        case t: UserDefinedType[_] =>
    +          t.sqlType
    --- End diff --
    
    expand(t.sqlType)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145969709
  
      [Test build #43291 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43291/consoleFull) for   PR 8988 at commit [`a680f09`](https://github.com/apache/spark/commit/a680f09320aabd65f64fe071ed6b697862500eb9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146393295
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43360/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145949010
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/8988


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145745373
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146371108
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146379704
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145692337
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146000181
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43291/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145713432
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145995576
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43288/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146371123
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146651270
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146431213
  
    @davies The last build failure was because Hive only recognizes decimals written as `FIXED_LEN_BYTE_ARRAY`. This might be a good reason for turning legacy mode on by default?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146372265
  
      [Test build #43361 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43361/consoleFull) for   PR 8988 at commit [`2bc5ebc`](https://github.com/apache/spark/commit/2bc5ebcf8c473817af68b056f772eb77a48acb07).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146370097
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43355/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146675256
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43417/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8988#discussion_r41547505
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala ---
    @@ -99,16 +99,18 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSQLContext {
         withSQLConf(SQLConf.PARQUET_BINARY_AS_STRING.key -> "true")(checkParquetFile(data))
       }
     
    -  test("fixed-length decimals") {
    -    def makeDecimalRDD(decimal: DecimalType): DataFrame =
    -      sparkContext
    -        .parallelize(0 to 1000)
    -        .map(i => Tuple1(i / 100.0))
    -        .toDF()
    -        // Parquet doesn't allow column names with spaces, have to add an alias here
    -        .select($"_1" cast decimal as "dec")
    +  testStandardAndLegacyModes("fixed-length decimals") {
    +    def makeDecimalRDD(decimal: DecimalType): DataFrame = {
    +      sqlContext
    +        .range(1000)
    +        // Parquet doesn't allow column names with spaces, have to add an alias here.
    +        // Minus 500 here so that negative decimals are also tested.
    +        .select((('id - 500) / 100.0) cast decimal as 'dec)
    --- End diff --
    
    nwm, we already specify the precision and scale.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146456998
  
      [Test build #43387 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43387/console) for   PR 8988 at commit [`e67d0b1`](https://github.com/apache/spark/commit/e67d0b1082b9a9a0b268875826889c8a945e0109).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145956438
  
      [Test build #43284 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43284/console) for   PR 8988 at commit [`d395bfd`](https://github.com/apache/spark/commit/d395bfd0bb73a7522e807e43f80e4307a65001a2).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8988#discussion_r41214391
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala ---
    @@ -553,16 +557,8 @@ private[parquet] object CatalystSchemaConverter {
         numBytes
       }
     
    -  private val MIN_BYTES_FOR_PRECISION = Array.tabulate[Int](39)(computeMinBytesForPrecision)
    -
       // Returns the minimum number of bytes needed to store a decimal with a given `precision`.
    -  def minBytesForPrecision(precision : Int) : Int = {
    -    if (precision < MIN_BYTES_FOR_PRECISION.length) {
    -      MIN_BYTES_FOR_PRECISION(precision)
    -    } else {
    -      computeMinBytesForPrecision(precision)
    -    }
    -  }
    --- End diff --
    
    This method is not necessary anymore since we don't support unlimited decimal precision now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8988#discussion_r41547696
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala ---
    @@ -176,6 +178,28 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSQLContext {
         }
       }
     
    +  test("read standard Parquet file under legacy mode") {
    +    withTempPath { dir =>
    +      val path = dir.getCanonicalPath
    +
    +      withSQLConf(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key -> "false") {
    +        sqlContext
    +          .range(4)
    +          .selectExpr("NAMED_STRUCT('a', id, 'b', ARRAY(id, id + 1, id + 2)) AS s")
    +          .write
    +          .parquet(path)
    +      }
    +
    +      withSQLConf(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key -> "true") {
    --- End diff --
    
    This config will not affect the reading path, what's the purpose of this test? I think it's already covered by `testStandardAndLegacyModes`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146370352
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146000179
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-146370359
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145968972
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145693262
  
      [Test build #43258 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43258/consoleFull) for   PR 8988 at commit [`d1583f8`](https://github.com/apache/spark/commit/d1583f88507a32afb509d33313d0cbe02de4897a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8988#discussion_r41545090
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystWriteSupport.scala ---
    @@ -0,0 +1,432 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.parquet
    +
    +import java.nio.{ByteBuffer, ByteOrder}
    +import java.util
    +
    +import scala.collection.JavaConverters.mapAsJavaMapConverter
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.parquet.column.ParquetProperties
    +import org.apache.parquet.hadoop.ParquetOutputFormat
    +import org.apache.parquet.hadoop.api.WriteSupport
    +import org.apache.parquet.hadoop.api.WriteSupport.WriteContext
    +import org.apache.parquet.io.api.{Binary, RecordConsumer}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.sql.SQLConf
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.SpecializedGetters
    +import org.apache.spark.sql.catalyst.util.DateTimeUtils
    +import org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.{MAX_PRECISION_FOR_INT32, MAX_PRECISION_FOR_INT64, minBytesForPrecision}
    +import org.apache.spark.sql.types._
    +
    +/**
    + * A Parquet [[WriteSupport]] implementation that writes Catalyst [[InternalRow]]s as Parquet
    + * messages.  This class can write Parquet data in two modes:
    + *
    + *  - Standard mode: Parquet data are written in standard format defined in parquet-format spec.
    + *  - Legacy mode: Parquet data are written in legacy format compatible with Spark 1.4 and prior.
    + *
    + * This behavior can be controlled by SQL option `spark.sql.parquet.writeLegacyFormat`.  The value
    + * of this option is propagated to this class by the `init()` method and its Hadoop configuration
    + * argument.
    + */
    +private[parquet] class CatalystWriteSupport extends WriteSupport[InternalRow] with Logging {
    +  // A `ValueWriter` is responsible for writing a field of an `InternalRow` to the record consumer.
    +  // Here we are using `SpecializedGetters` rather than `InternalRow` so that we can directly access
    +  // data in `ArrayData` without the help of `SpecificMutableRow`.
    +  private type ValueWriter = (SpecializedGetters, Int) => Unit
    +
    +  // Schema of the `InternalRow`s to be written
    +  private var schema: StructType = _
    +
    +  // `ValueWriter`s for all fields of the schema
    +  private var rootFieldWriters: Seq[ValueWriter] = _
    +
    +  // The Parquet `RecordConsumer` to which all `InternalRow`s are written
    +  private var recordConsumer: RecordConsumer = _
    +
    +  // Whether to write data in legacy Parquet format compatible with Spark 1.4 and prior versions
    +  private var writeLegacyParquetFormat: Boolean = _
    +
    +  // Reusable byte array used to write timestamps as Parquet INT96 values
    +  private val timestampBuffer = new Array[Byte](12)
    +
    +  // Reusable byte array used to write decimal values
    +  private val decimalBuffer = new Array[Byte](minBytesForPrecision(DecimalType.MAX_PRECISION))
    +
    +  override def init(configuration: Configuration): WriteContext = {
    +    val schemaString = configuration.get(CatalystWriteSupport.SPARK_ROW_SCHEMA)
    +    this.schema = StructType.fromString(schemaString)
    +    this.writeLegacyParquetFormat = {
    +      // `SQLConf.PARQUET_WRITE_LEGACY_FORMAT` should always be explicitly set in ParquetRelation
    +      assert(configuration.get(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key) != null)
    +      configuration.get(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key).toBoolean
    +    }
    +    this.rootFieldWriters = schema.map(_.dataType).map(makeWriter)
    +
    +    val messageType = new CatalystSchemaConverter(configuration).convert(schema)
    +    val metadata = Map(CatalystReadSupport.SPARK_METADATA_KEY -> schemaString).asJava
    +
    +    logInfo(
    +      s"""Initialized Parquet WriteSupport with Catalyst schema:
    +         |${schema.prettyJson}
    +         |and corresponding Parquet message type:
    +         |$messageType
    +       """.stripMargin)
    +
    +    new WriteContext(messageType, metadata)
    +  }
    +
    +  override def prepareForWrite(recordConsumer: RecordConsumer): Unit = {
    +    this.recordConsumer = recordConsumer
    +  }
    +
    +  override def write(row: InternalRow): Unit = {
    +    consumeMessage(writeFields(row, schema, rootFieldWriters))
    +  }
    +
    +  private def writeFields(
    +      row: InternalRow, schema: StructType, fieldWriters: Seq[ValueWriter]): Unit = {
    +    var i = 0
    +    while (i < row.numFields) {
    +      if (!row.isNullAt(i)) {
    +        consumeField(schema(i).name, i) {
    +          fieldWriters(i).apply(row, i)
    +        }
    +      }
    +      i += 1
    +    }
    +  }
    +
    +  private def makeWriter(dataType: DataType): ValueWriter = {
    +    dataType match {
    +      case BooleanType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addBoolean(row.getBoolean(ordinal))
    +
    +      case ByteType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addInteger(row.getByte(ordinal))
    +
    +      case ShortType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addInteger(row.getShort(ordinal))
    +
    +      case IntegerType | DateType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addInteger(row.getInt(ordinal))
    +
    +      case LongType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addLong(row.getLong(ordinal))
    +
    +      case FloatType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addFloat(row.getFloat(ordinal))
    +
    +      case DoubleType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addDouble(row.getDouble(ordinal))
    +
    +      case StringType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addBinary(Binary.fromByteArray(row.getUTF8String(ordinal).getBytes))
    +
    +      case TimestampType =>
    +        (row: SpecializedGetters, ordinal: Int) => {
    +          // TODO Writes `TimestampType` values as `TIMESTAMP_MICROS` once parquet-mr implements it
    +          // Currently we only support timestamps stored as INT96, which is compatible with Hive
    +          // and Impala.  However, INT96 is to be deprecated.  We plan to support `TIMESTAMP_MICROS`
    +          // defined in the parquet-format spec.  But up until writing, the most recent parquet-mr
    +          // version (1.8.1) hasn't implemented it yet.
    +
    +          // NOTE: Starting from Spark 1.5, Spark SQL `TimestampType` only has microsecond
    +          // precision.  Nanosecond parts of timestamp values read from INT96 are simply stripped.
    +          val (julianDay, timeOfDayNanos) = DateTimeUtils.toJulianDay(row.getLong(ordinal))
    +          val buf = ByteBuffer.wrap(timestampBuffer)
    +          buf.order(ByteOrder.LITTLE_ENDIAN).putLong(timeOfDayNanos).putInt(julianDay)
    +          recordConsumer.addBinary(Binary.fromByteArray(timestampBuffer))
    +        }
    +
    +      case BinaryType =>
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          recordConsumer.addBinary(Binary.fromByteArray(row.getBinary(ordinal)))
    +
    +      case DecimalType.Fixed(precision, scale) =>
    +        makeDecimalWriter(precision, scale)
    +
    +      case t: StructType =>
    +        val fieldWriters = t.map(_.dataType).map(makeWriter)
    +        (row: SpecializedGetters, ordinal: Int) =>
    +          consumeGroup(writeFields(row.getStruct(ordinal, t.length), t, fieldWriters))
    --- End diff --
    
    ```
    consumeGroup {
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/8988#issuecomment-146685643

A note about interoperability:

Hive 1.2.1 can read Parquet arrays and maps written in standard format. However, it still doesn't recognize Parquet decimals stored as `INT32` or `INT64` ([HIVE-12069] [1]). There are two options to workaround this issue:

1. Always turn on legacy mode if the written Parquet files are supposed to be used together with Hive.

Legacy mode is turned off by default in this PR.

2. Add a separate SQL option `spark.sql.parquet.writeCompactDecimal` to indicate whether decimals can be written as `INT32` and `INT64`.

This PR hasn't implemented this option yet. If we prefer this approach, I can do it in another PR. We probably want this option to be `false` by default.

I'd vote for 2.

@davies @marmbrus @rxin Thoughts?

[1]: https://issues.apache.org/jira/browse/HIVE-12069

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8848] [SQL] Refactors Parquet write pat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8988#issuecomment-145956504
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43284/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org