You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2020/10/12 06:46:00 UTC
[jira] [Commented] (SPARK-32281) Spark wipes out SORTED spec in
metastore when DESC is used
[ https://issues.apache.org/jira/browse/SPARK-32281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212156#comment-17212156 ]
Apache Spark commented on SPARK-32281:
--------------------------------------
User 'AngersZhuuuu' has created a pull request for this issue:
https://github.com/apache/spark/pull/30011
> Spark wipes out SORTED spec in metastore when DESC is used
> ----------------------------------------------------------
>
> Key: SPARK-32281
> URL: https://issues.apache.org/jira/browse/SPARK-32281
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.1.0
> Reporter: Bruce Robbins
> Priority: Major
>
> When altering a Hive bucketed table or updating its statistics, Spark will wipe out the SORTED specification in the metastore if the specification uses DESC.
> For example:
> {noformat}
> 0: jdbc:hive2://localhost:10000> -- in beeline
> 0: jdbc:hive2://localhost:10000> create table bucketed (a int, b int, c int, d int) clustered by (c) sorted by (c asc, d desc) into 10 buckets;
> No rows affected (0.045 seconds)
> 0: jdbc:hive2://localhost:10000> show create table bucketed;
> +----------------------------------------------------+
> | createtab_stmt |
> +----------------------------------------------------+
> | CREATE TABLE `bucketed`( |
> | `a` int, |
> | `b` int, |
> | `c` int, |
> | `d` int) |
> | CLUSTERED BY ( |
> | c) |
> | SORTED BY ( |
> | c ASC, |
> | d DESC) |
> | INTO 10 BUCKETS |
> | ROW FORMAT SERDE |
> | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
> | STORED AS INPUTFORMAT |
> | 'org.apache.hadoop.mapred.TextInputFormat' |
> | OUTPUTFORMAT |
> | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
> | LOCATION |
> | 'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
> | TBLPROPERTIES ( |
> | 'transient_lastDdlTime'='1594488043') |
> +----------------------------------------------------+
> 21 rows selected (0.042 seconds)
> 0: jdbc:hive2://localhost:10000>
> -
> -
> -
> scala> // in spark
> scala> sql("alter table bucketed set tblproperties ('foo'='bar')")
> 20/07/11 10:21:36 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
> 20/07/11 10:21:38 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
> res0: org.apache.spark.sql.DataFrame = []
> scala>
> -
> -
> -
> 0: jdbc:hive2://localhost:10000> -- back in beeline
> 0: jdbc:hive2://localhost:10000> show create table bucketed;
> +----------------------------------------------------+
> | createtab_stmt |
> +----------------------------------------------------+
> | CREATE TABLE `bucketed`( |
> | `a` int, |
> | `b` int, |
> | `c` int, |
> | `d` int) |
> | CLUSTERED BY ( |
> | c) |
> | INTO 10 BUCKETS |
> | ROW FORMAT SERDE |
> | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
> | STORED AS INPUTFORMAT |
> | 'org.apache.hadoop.mapred.TextInputFormat' |
> | OUTPUTFORMAT |
> | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
> | LOCATION |
> | 'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
> | TBLPROPERTIES ( |
> | 'foo'='bar', |
> | 'spark.sql.partitionProvider'='catalog', |
> | 'transient_lastDdlTime'='1594488098') |
> +----------------------------------------------------+
> 20 rows selected (0.038 seconds)
> 0: jdbc:hive2://localhost:10000>
> {noformat}
> Note that the SORTED specification disappears.
> Another example, this time using insert:
> {noformat}
> 0: jdbc:hive2://localhost:10000> -- in beeline
> 0: jdbc:hive2://localhost:10000> create table bucketed (a int, b int, c int, d int) clustered by (c) sorted by (c asc, d desc) into 10 buckets;
> No rows affected (0.055 seconds)
> 0: jdbc:hive2://localhost:10000> insert into table bucketed values (0, 1, 2, 3);
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
> No rows affected (1.689 seconds)
> 0: jdbc:hive2://localhost:10000> analyze table bucketed compute statistics;
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
> No rows affected (1.516 seconds)
> 0: jdbc:hive2://localhost:10000> show create table bucketed;
> +----------------------------------------------------+
> | createtab_stmt |
> +----------------------------------------------------+
> | CREATE TABLE `bucketed`( |
> | `a` int, |
> | `b` int, |
> | `c` int, |
> | `d` int) |
> | CLUSTERED BY ( |
> | c) |
> | SORTED BY ( |
> | c ASC, |
> | d DESC) |
> | INTO 10 BUCKETS |
> | ROW FORMAT SERDE |
> | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
> | STORED AS INPUTFORMAT |
> | 'org.apache.hadoop.mapred.TextInputFormat' |
> | OUTPUTFORMAT |
> | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
> | LOCATION |
> | 'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
> | TBLPROPERTIES ( |
> | 'transient_lastDdlTime'='1594488191') |
> +----------------------------------------------------+
> 21 rows selected (0.078 seconds)
> 0: jdbc:hive2://localhost:10000>
> -
> -
> -
> scala> // in spark
> scala> sql("set hive.enforce.sorting=false")
> 20/07/11 10:23:57 WARN SetCommand: 'SET hive.enforce.sorting=false' might not work, since Spark doesn't support changing the Hive config dynamically. Please pass the Hive-specific config by adding the prefix spark.hadoop (e.g. spark.hadoop.hive.enforce.sorting) when starting a Spark application. For details, see the link: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> sql("set hive.enforce.bucketing=false")
> 20/07/11 10:24:01 WARN SetCommand: 'SET hive.enforce.bucketing=false' might not work, since Spark doesn't support changing the Hive config dynamically. Please pass the Hive-specific config by adding the prefix spark.hadoop (e.g. spark.hadoop.hive.enforce.bucketing) when starting a Spark application. For details, see the link: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.range(0,1000).map { x => (x, x + 1, x + 2, x + 3) }.
> toDF("a", "b", "c", "d").createOrReplaceTempView("df")
> |
> scala>
> scala> sql("insert into bucketed select * from df")
> 20/07/11 10:24:15 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
> 20/07/11 10:24:16 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
> 20/07/11 10:24:16 WARN InsertIntoHiveTable: Output Hive table `default`.`bucketed` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive. Inserting data anyways since both hive.enforce.bucketing and hive.enforce.sorting are set to false.
> 20/07/11 10:24:19 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
> res3: org.apache.spark.sql.DataFrame = []
> scala>
> -
> -
> -
> 0: jdbc:hive2://localhost:10000> -- back in beeline
> 0: jdbc:hive2://localhost:10000> show create table bucketed;
> +----------------------------------------------------+
> | createtab_stmt |
> +----------------------------------------------------+
> | CREATE TABLE `bucketed`( |
> | `a` int, |
> | `b` int, |
> | `c` int, |
> | `d` int) |
> | CLUSTERED BY ( |
> | c) |
> | INTO 10 BUCKETS |
> | ROW FORMAT SERDE |
> | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
> | STORED AS INPUTFORMAT |
> | 'org.apache.hadoop.mapred.TextInputFormat' |
> | OUTPUTFORMAT |
> | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
> | LOCATION |
> | 'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
> | TBLPROPERTIES ( |
> | 'transient_lastDdlTime'='1594488259') |
> +----------------------------------------------------+
> 18 rows selected (0.041 seconds)
> 0: jdbc:hive2://localhost:10000>
> {noformat}
> Note that the SORTED specification disappears.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org