You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by chenghao-intel <gi...@git.apache.org> on 2015/04/22 10:01:22 UTC

[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

GitHub user chenghao-intel opened a pull request:

    https://github.com/apache/spark/pull/5630

    [SPARK-7051] [SQL] Configuration for parquet data writting

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/chenghao-intel/spark parquet

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5630.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5630
    
----
commit 62e587f40f7b06127c761152cddab7e75dc83879
Author: Cheng Hao <ha...@intel.com>
Date:   2015-04-22T07:50:54Z

    Parquet Configuration for writting

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5630#discussion_r28937324
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
    @@ -36,6 +36,8 @@ private[spark] object SQLConf {
       val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp"
       val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata"
       val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec"
    +  val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize"
    +  val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize"
    --- End diff --
    
    I meant Parquet should have its own conf keys for those, right? I do not need to add spark sql ones.
    
    See https://github.com/apache/parquet-mr/blob/parquet-1.6.0rc3/parquet-hadoop/src/main/java/parquet/hadoop/ParquetOutputFormat.java#L98-L106 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5630#discussion_r28892756
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
    @@ -36,6 +36,8 @@ private[spark] object SQLConf {
       val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp"
       val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata"
       val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec"
    +  val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize"
    +  val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize"
    --- End diff --
    
    I think that parquet has conf keys for these, right? Users just pass those properties in the `Options` (`parameters` in `ParquetRelation2`). Then, we will need to change parquet relation to set those properties correctly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5630#discussion_r28937128
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
    @@ -36,6 +36,8 @@ private[spark] object SQLConf {
       val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp"
       val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata"
       val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec"
    +  val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize"
    +  val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize"
    --- End diff --
    
    Yes, users have to use the set command in Hive, not specify the Hive's table properties.
    
    http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_parquet.html



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5630#discussion_r28936699
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
    @@ -36,6 +36,8 @@ private[spark] object SQLConf {
       val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp"
       val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata"
       val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec"
    +  val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize"
    +  val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize"
    --- End diff --
    
    You meant when users create a hive table by using `CREATE TABLE ... STORED AS PARQUET`? For this case, the user should use `set` or Hive's table properties, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5630#issuecomment-95068115
  
      [Test build #30741 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30741/consoleFull) for   PR 5630 at commit [`62e587f`](https://github.com/apache/spark/commit/62e587f40f7b06127c761152cddab7e75dc83879).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5630#discussion_r29310766
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
    @@ -36,6 +36,8 @@ private[spark] object SQLConf {
       val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp"
       val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata"
       val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec"
    +  val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize"
    +  val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize"
    --- End diff --
    
    @yhuai , I just confirmed, it still keep uncompressed parquet file even if we set the property like `SET parquet.compression=GZIP`.
    
    From the source code, https://github.com/chenghao-intel/spark/blob/parquet/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L636
    Settings in `SQLConf` are not append into the `Configuration`, that's why all of the settings doesn't take effect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5630#discussion_r28927882
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
    @@ -36,6 +36,8 @@ private[spark] object SQLConf {
       val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp"
       val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata"
       val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec"
    +  val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize"
    +  val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize"
    --- End diff --
    
    @yhuai  I know it will work if we create the table by external data source API, here is for making it consistency with Hive table.
    
    Sorry I should make it clearer, and I've updated the description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5630#issuecomment-95100781
  
      [Test build #30741 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30741/consoleFull) for   PR 5630 at commit [`62e587f`](https://github.com/apache/spark/commit/62e587f40f7b06127c761152cddab7e75dc83879).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5630#discussion_r28937606
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
    @@ -36,6 +36,8 @@ private[spark] object SQLConf {
       val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp"
       val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata"
       val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec"
    +  val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize"
    +  val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize"
    --- End diff --
    
    At least we need to add the prefix `spark.sql.`, right? like the what we did for `spark.sql.parquet.compression.codec`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5630#issuecomment-95100822
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30741/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5630#issuecomment-96767366
  
      [Test build #30992 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30992/consoleFull) for   PR 5630 at commit [`62e587f`](https://github.com/apache/spark/commit/62e587f40f7b06127c761152cddab7e75dc83879).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org