You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2020/10/12 06:46:00 UTC
[jira] [Commented] (SPARK-32281) Spark wipes out SORTED spec in metastore when DESC is used

    [ https://issues.apache.org/jira/browse/SPARK-32281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212156#comment-17212156 ] 

Apache Spark commented on SPARK-32281:
--------------------------------------

User 'AngersZhuuuu' has created a pull request for this issue:
https://github.com/apache/spark/pull/30011

> Spark wipes out SORTED spec in metastore when DESC is used
> ----------------------------------------------------------
>
>                 Key: SPARK-32281
>                 URL: https://issues.apache.org/jira/browse/SPARK-32281
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Bruce Robbins
>            Priority: Major
>
> When altering a Hive bucketed table or updating its statistics, Spark will wipe out the SORTED specification in the metastore if the specification uses DESC.
>  For example:
> {noformat}
> 0: jdbc:hive2://localhost:10000> -- in beeline
> 0: jdbc:hive2://localhost:10000> create table bucketed (a int, b int, c int, d int) clustered by (c) sorted by (c asc, d desc) into 10 buckets;
> No rows affected (0.045 seconds)
> 0: jdbc:hive2://localhost:10000> show create table bucketed;
> +----------------------------------------------------+
> |                   createtab_stmt                   |
> +----------------------------------------------------+
> | CREATE TABLE `bucketed`(                           |
> |   `a` int,                                         |
> |   `b` int,                                         |
> |   `c` int,                                         |
> |   `d` int)                                         |
> | CLUSTERED BY (                                     |
> |   c)                                               |
> | SORTED BY (                                        |
> |   c ASC,                                           |
> |   d DESC)                                          |
> | INTO 10 BUCKETS                                    |
> | ROW FORMAT SERDE                                   |
> |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
> | STORED AS INPUTFORMAT                              |
> |   'org.apache.hadoop.mapred.TextInputFormat'       |
> | OUTPUTFORMAT                                       |
> |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
> | LOCATION                                           |
> |   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
> | TBLPROPERTIES (                                    |
> |   'transient_lastDdlTime'='1594488043')            |
> +----------------------------------------------------+
> 21 rows selected (0.042 seconds)
> 0: jdbc:hive2://localhost:10000> 
> -
> -
> -
> scala> // in spark
> scala> sql("alter table bucketed set tblproperties ('foo'='bar')")
> 20/07/11 10:21:36 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
> 20/07/11 10:21:38 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
> res0: org.apache.spark.sql.DataFrame = []
> scala> 
> -
> -
> -
> 0: jdbc:hive2://localhost:10000> -- back in beeline
> 0: jdbc:hive2://localhost:10000> show create table bucketed;
> +----------------------------------------------------+
> |                   createtab_stmt                   |
> +----------------------------------------------------+
> | CREATE TABLE `bucketed`(                           |
> |   `a` int,                                         |
> |   `b` int,                                         |
> |   `c` int,                                         |
> |   `d` int)                                         |
> | CLUSTERED BY (                                     |
> |   c)                                               |
> | INTO 10 BUCKETS                                    |
> | ROW FORMAT SERDE                                   |
> |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
> | STORED AS INPUTFORMAT                              |
> |   'org.apache.hadoop.mapred.TextInputFormat'       |
> | OUTPUTFORMAT                                       |
> |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
> | LOCATION                                           |
> |   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
> | TBLPROPERTIES (                                    |
> |   'foo'='bar',                                     |
> |   'spark.sql.partitionProvider'='catalog',         |
> |   'transient_lastDdlTime'='1594488098')            |
> +----------------------------------------------------+
> 20 rows selected (0.038 seconds)
> 0: jdbc:hive2://localhost:10000> 
> {noformat}
> Note that the SORTED specification disappears.
> Another example, this time using insert:
> {noformat}
> 0: jdbc:hive2://localhost:10000> -- in beeline
> 0: jdbc:hive2://localhost:10000> create table bucketed (a int, b int, c int, d int) clustered by (c) sorted by (c asc, d desc) into 10 buckets;
> No rows affected (0.055 seconds)
> 0: jdbc:hive2://localhost:10000> insert into table bucketed values (0, 1, 2, 3);
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
> No rows affected (1.689 seconds)
> 0: jdbc:hive2://localhost:10000> analyze table bucketed compute statistics;
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
> No rows affected (1.516 seconds)
> 0: jdbc:hive2://localhost:10000> show create table bucketed;
> +----------------------------------------------------+
> |                   createtab_stmt                   |
> +----------------------------------------------------+
> | CREATE TABLE `bucketed`(                           |
> |   `a` int,                                         |
> |   `b` int,                                         |
> |   `c` int,                                         |
> |   `d` int)                                         |
> | CLUSTERED BY (                                     |
> |   c)                                               |
> | SORTED BY (                                        |
> |   c ASC,                                           |
> |   d DESC)                                          |
> | INTO 10 BUCKETS                                    |
> | ROW FORMAT SERDE                                   |
> |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
> | STORED AS INPUTFORMAT                              |
> |   'org.apache.hadoop.mapred.TextInputFormat'       |
> | OUTPUTFORMAT                                       |
> |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
> | LOCATION                                           |
> |   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
> | TBLPROPERTIES (                                    |
> |   'transient_lastDdlTime'='1594488191')            |
> +----------------------------------------------------+
> 21 rows selected (0.078 seconds)
> 0: jdbc:hive2://localhost:10000> 
> -
> -
> -
> scala> // in spark
> scala> sql("set hive.enforce.sorting=false")
> 20/07/11 10:23:57 WARN SetCommand: 'SET hive.enforce.sorting=false' might not work, since Spark doesn't support changing the Hive config dynamically. Please pass the Hive-specific config by adding the prefix spark.hadoop (e.g. spark.hadoop.hive.enforce.sorting) when starting a Spark application. For details, see the link: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> sql("set hive.enforce.bucketing=false")
> 20/07/11 10:24:01 WARN SetCommand: 'SET hive.enforce.bucketing=false' might not work, since Spark doesn't support changing the Hive config dynamically. Please pass the Hive-specific config by adding the prefix spark.hadoop (e.g. spark.hadoop.hive.enforce.bucketing) when starting a Spark application. For details, see the link: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.range(0,1000).map { x => (x, x + 1, x + 2, x + 3) }.
>   toDF("a", "b", "c", "d").createOrReplaceTempView("df")
>      | 
> scala> 
> scala> sql("insert into bucketed select * from df")
> 20/07/11 10:24:15 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
> 20/07/11 10:24:16 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
> 20/07/11 10:24:16 WARN InsertIntoHiveTable: Output Hive table `default`.`bucketed` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive. Inserting data anyways since both hive.enforce.bucketing and hive.enforce.sorting are set to false.
> 20/07/11 10:24:19 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
> res3: org.apache.spark.sql.DataFrame = []
> scala> 
> -
> -
> -
> 0: jdbc:hive2://localhost:10000> -- back in beeline
> 0: jdbc:hive2://localhost:10000> show create table bucketed;
> +----------------------------------------------------+
> |                   createtab_stmt                   |
> +----------------------------------------------------+
> | CREATE TABLE `bucketed`(                           |
> |   `a` int,                                         |
> |   `b` int,                                         |
> |   `c` int,                                         |
> |   `d` int)                                         |
> | CLUSTERED BY (                                     |
> |   c)                                               |
> | INTO 10 BUCKETS                                    |
> | ROW FORMAT SERDE                                   |
> |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
> | STORED AS INPUTFORMAT                              |
> |   'org.apache.hadoop.mapred.TextInputFormat'       |
> | OUTPUTFORMAT                                       |
> |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
> | LOCATION                                           |
> |   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
> | TBLPROPERTIES (                                    |
> |   'transient_lastDdlTime'='1594488259')            |
> +----------------------------------------------------+
> 18 rows selected (0.041 seconds)
> 0: jdbc:hive2://localhost:10000> 
> {noformat}
> Note that the SORTED specification disappears.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org