You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/15 09:57:55 UTC
[GitHub] [spark] AngersZhuuuu opened a new pull request #35528: [SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible
AngersZhuuuu opened a new pull request #35528:
URL: https://github.com/apache/spark/pull/35528
### What changes were proposed in this pull request?
Currently spark sql
```
INSERT OVERWRITE DIRECTORY 'path'
STORED AS PARQUET
query
```
can't be converted to use InsertIntoDataSourceCommand, still use Hive SerDe to write data, this cause we can't use feature provided by new parquet/orc version, such as zstd compress.
```
spark-sql> INSERT OVERWRITE DIRECTORY 'hdfs://nameservice/user/hive/warehouse/test_zstd_dir'
> stored as parquet
> select 1 as id;
[Stage 5:> (0 + 1) / 1]22/02/15 16:49:31 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, ip-xx-xx-xx-xx, executor 21): org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: No enum constant parquet.hadoop.metadata.CompressionCodecName.ZSTD
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:123)
at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:269)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:203)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:202)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
### Why are the changes needed?
Convert InsertIntoHiveDirCommand to InsertIntoDataSourceCommand can support more features of parquet/orc
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #35528: [SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #35528:
URL: https://github.com/apache/spark/pull/35528#discussion_r808838366
##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala
##########
@@ -198,11 +198,20 @@ case class RelationConversions(
}
private def isConvertible(tableMeta: CatalogTable): Boolean = {
- val serde = tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
+ isConvertible(tableMeta.storage)
+ }
+
+ private def isConvertible(storage: CatalogStorageFormat): Boolean = {
+ val serde = storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
serde.contains("parquet") && conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
serde.contains("orc") && conf.getConf(HiveUtils.CONVERT_METASTORE_ORC)
}
+ private def convertProvider(storage: CatalogStorageFormat): String = {
+ val serde = storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
+ Some("parquet").filter(serde.contains).getOrElse("orc")
Review comment:
nit:
```
if (serde.contains("parquet")) parquet else orc
```
is much simpler
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #35528: [SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #35528:
URL: https://github.com/apache/spark/pull/35528#issuecomment-1043979668
Gentle ping @cloud-fan GA passed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #35528: [SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible
Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #35528:
URL: https://github.com/apache/spark/pull/35528
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #35528: [SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #35528:
URL: https://github.com/apache/spark/pull/35528#issuecomment-1042656806
Also ping @dongjoon-hyun @HyukjinKwon
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #35528: [SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #35528:
URL: https://github.com/apache/spark/pull/35528#issuecomment-1044104566
thanks, merging to master!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #35528: [SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #35528:
URL: https://github.com/apache/spark/pull/35528#discussion_r808842136
##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala
##########
@@ -198,11 +198,20 @@ case class RelationConversions(
}
private def isConvertible(tableMeta: CatalogTable): Boolean = {
- val serde = tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
+ isConvertible(tableMeta.storage)
+ }
+
+ private def isConvertible(storage: CatalogStorageFormat): Boolean = {
+ val serde = storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
serde.contains("parquet") && conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
serde.contains("orc") && conf.getConf(HiveUtils.CONVERT_METASTORE_ORC)
}
+ private def convertProvider(storage: CatalogStorageFormat): String = {
+ val serde = storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
+ Some("parquet").filter(serde.contains).getOrElse("orc")
Review comment:
> if (serde.contains("parquet")) parquet else orc
updated
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #35528: [SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #35528:
URL: https://github.com/apache/spark/pull/35528#issuecomment-1042516622
Gentle ping @cloud-fan @viirya Could you take a review? it's a useful feature.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #35528: [SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible
Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #35528:
URL: https://github.com/apache/spark/pull/35528#discussion_r809629463
##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala
##########
@@ -186,6 +186,8 @@ object HiveAnalysis extends Rule[LogicalPlan] {
* - When writing to non-partitioned Hive-serde Parquet/Orc tables
* - When writing to partitioned Hive-serde Parquet/Orc tables when
* `spark.sql.hive.convertInsertingPartitionedTable` is true
+ * - When writing to directory with Hive-serde
+ * - When writing to non-partitioned Hive-serde Parquet/ORC tables using CTAS
Review comment:
@cloud-fan Update the comment of this rule, also add comment about `CTAS`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org