You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/09 10:52:35 UTC

[GitHub] [spark] dongjoon-hyun opened a new pull request, #38180: [SPARK-40719][SQL] CTAS should respect TBLPROPERTIES during execution

dongjoon-hyun opened a new pull request, #38180:
URL: https://github.com/apache/spark/pull/38180

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38180:
URL: https://github.com/apache/spark/pull/38180#issuecomment-1272792441

   I'll convert this to the `Draft` and rethink about the issue. Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38180:
URL: https://github.com/apache/spark/pull/38180#issuecomment-1274738414

   When we read/write Hive Parquet tables using native Parquet data source, we extract read/write options from table properties:
   https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L132-L138
   
   I don't know the reason of this special case (the code seems to be there for a long time), but this seems like something we can always do: when read/write a data source table, combine table properties and storage properties as the read/write options.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38180:
URL: https://github.com/apache/spark/pull/38180#discussion_r990898535


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala:
##########
@@ -237,7 +237,7 @@ case class CreateDataSourceTableAsSelectCommand(
       className = table.provider.get,
       partitionColumns = table.partitionColumnNames,
       bucketSpec = table.bucketSpec,
-      options = table.storage.properties ++ pathOption,
+      options = table.properties ++ table.storage.properties ++ pathOption,

Review Comment:
   I think the PR title is a bit misleading. We do respect TBLPROPERTIES in CTAS, but we do not include them as DS v1 write options, which should be specified by OPTIONS, or SERDEPROPERTIES.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38180:
URL: https://github.com/apache/spark/pull/38180#issuecomment-1272783785

   I have two ideas in my mind.
   1. We can allow the same pattern here for ORC and Parquet.
   2. We can introduce a new configuration for a pattern to select table property in a more general way. For example, `orc|parquet` or `'*'`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #38180:
URL: https://github.com/apache/spark/pull/38180#issuecomment-1397774428

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38180:
URL: https://github.com/apache/spark/pull/38180#issuecomment-1272751955

   Do you know why `STORED AS` works? AFAIK the storage options should be specified by storage properties, but not table properties.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38180:
URL: https://github.com/apache/spark/pull/38180#issuecomment-1272603329

   Hi, @cloud-fan . I know that we have `OPTIONS` clause for storage options, but can we allow this too at CTAS?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38180:
URL: https://github.com/apache/spark/pull/38180#issuecomment-1275026991

   Thank you, @cloud-fan .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38180:
URL: https://github.com/apache/spark/pull/38180#issuecomment-1272766758

   @cloud-fan . Apache Spark CTAS with `STORED BY ... TBLPROPERTIES` table has been simply consistent with `CREATE TABLE` + `INSERT INTO` like the following.
   ```
   spark-sql> CREATE TABLE t(a int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'zstd');
   spark-sql> INSERT INTO t VALUES(1);
   ```
   
   ```
   $ ls -al spark-warehouse/t
   total 24
   drwxr-xr-x  6 dongjoon  staff  192 Oct  9 21:13 .
   drwxr-xr-x  3 dongjoon  staff   96 Oct  9 21:13 ..
   -rw-r--r--  1 dongjoon  staff    8 Oct  9 21:13 ._SUCCESS.crc
   -rw-r--r--  1 dongjoon  staff   12 Oct  9 21:13 .part-00000-0a4b4b66-8970-4223-8263-18d35c3eee43-c000.zstd.parquet.crc
   -rw-r--r--  1 dongjoon  staff    0 Oct  9 21:13 _SUCCESS
   -rw-r--r--  1 dongjoon  staff  450 Oct  9 21:13 part-00000-0a4b4b66-8970-4223-8263-18d35c3eee43-c000.zstd.parquet
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38180:
URL: https://github.com/apache/spark/pull/38180#issuecomment-1272779997

   So the problem is different data sources may interpret table properties differently. Hive serde parquet table respects the parquet options in table properties, but native parquet table does not. If we want to support it, we should clearly define the relationship between table properties, storage properties, and options.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38180:
URL: https://github.com/apache/spark/pull/38180#issuecomment-1272773175

   Oh, got it. It doesn't work, either.
   ```
   $ ls -al spark-warehouse/t
   total 24
   drwxr-xr-x  6 dongjoon  staff  192 Oct  9 21:25 .
   drwxr-xr-x  3 dongjoon  staff   96 Oct  9 21:24 ..
   -rw-r--r--  1 dongjoon  staff    8 Oct  9 21:25 ._SUCCESS.crc
   -rw-r--r--  1 dongjoon  staff   12 Oct  9 21:25 .part-00000-b7924aff-f367-4043-b3af-0ef9fbadef54-c000.snappy.parquet.crc
   -rw-r--r--  1 dongjoon  staff    0 Oct  9 21:25 _SUCCESS
   -rw-r--r--  1 dongjoon  staff  443 Oct  9 21:25 part-00000-b7924aff-f367-4043-b3af-0ef9fbadef54-c000.snappy.parquet
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38180:
URL: https://github.com/apache/spark/pull/38180#issuecomment-1272771051

   What's the behavior of `CREATE TABLE t(a int) USING PARQUET TBLPROPERTIES (parquet.compression 'zstd')` + `INSERT INTO t VALUES(1)`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38180:
URL: https://github.com/apache/spark/pull/38180#discussion_r990898884


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala:
##########
@@ -230,6 +230,16 @@ class InMemoryCatalogedDDLSuite extends DDLSuite with SharedSparkSession {
       }
     }
   }
+
+  test("SPARK-40719: CTAS should respect TBLPROPERTIES during query execution") {
+    withTable("t") {
+      withTempDir { dir =>
+        sql(s"CREATE TABLE t USING PARQUET LOCATION '$dir' " +
+          "TBLPROPERTIES (parquet.compression 'zstd') AS SELECT 1")

Review Comment:
   I agree that it's better to be consistent between the Spark native CREATE TABLE syntax and the Hive one, but we probably should revisit the Hive syntax first. If it's reasonable (treat TBLPROPERTIES as write options), we can follow it in the native syntax



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38180:
URL: https://github.com/apache/spark/pull/38180#issuecomment-1272777252

   Ah, sorry. After looking at the log, I realized that we did and documented this officially.
   
   https://github.com/apache/spark/blob/cf053836e8f75645dab9551683a4e9e32e9625c8/docs/sql-migration-guide.md#L539


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution
URL: https://github.com/apache/spark/pull/38180


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38180:
URL: https://github.com/apache/spark/pull/38180#issuecomment-1272779405

   According to the migration guide, we fixed `STORED AS ... TBLPROPERTIES` at Apache Spark 2.4.
   - https://github.com/apache/spark/blame/master/docs/sql-migration-guide.md#L539
   ```
   Since Spark 2.4, Spark respects Parquet/ORC specific table properties while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, the result would be uncompressed parquet files.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38180: [SPARK-40719][SQL] `CTAS` should respect `TBLPROPERTIES` during execution

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38180:
URL: https://github.com/apache/spark/pull/38180#issuecomment-1272781617

   Ya, I agree. According to the current support rule,
   https://github.com/apache/spark/blob/cf053836e8f75645dab9551683a4e9e32e9625c8/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L134-L138
   
   The following is our support rule.
   ```
     private def isOrcProperty(key: String) =
       key.startsWith("orc.") || key.contains(".orc.")
   
     private def isParquetProperty(key: String) =
       key.startsWith("parquet.") || key.contains(".parquet.")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org