You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ashish Sharma (Jira)" <ji...@apache.org> on 2024/01/02 04:10:00 UTC
[jira] [Updated] (SPARK-46477) Hive table (bucketdb.dept_part_buk) is bucketed but partition (location=nk) is not bucketed
[ https://issues.apache.org/jira/browse/SPARK-46477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ashish Sharma updated SPARK-46477:
----------------------------------
Description:
Presto fail to read partition of hive table update by spark sql with following error
{noformat}
Hive table (bucketdb.dept_part_buk) is bucketed but partition (location=nk) is not bucketed
{noformat}
Spark SQL which cause read failure in presto
{noformat}
ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 'file:///tmp/location=nk1';
{noformat}
*Root Cause*
ALTER TABLE table_name [PARTITION partition_spec] SET LOCATION "loc";
is dropping bucket columns information in HMS.
Repo Script
{code:java}
CREATE TABLE bucketdb.dept_part_buk ( deptno INT, dname STRING, location STRING) PARTITIONED BY (location) CLUSTERED BY (deptno) INTO 2 BUCKETS STORED AS textfile location 'file:///tmp';
ALTER TABLE bucketdb.dept_part_buk ADD IF NOT EXISTS PARTITION (location='nk') LOCATION 'file:///tmp/location=nk';
ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 'file:///tmp/location=nk1';
{code}
*Investigation*
*HMS state before running query *
{noformat}
mysql> select * from SDS where SD_ID = 137;
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+
| SD_ID | CD_ID | INPUT_FORMAT | IS_COMPRESSED | IS_STOREDASSUBDIRECTORIES | LOCATION | NUM_BUCKETS | OUTPUT_FORMAT | SERDE_ID |
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+
| 137 | 106 | org.apache.hadoop.mapred.TextInputFormat | 0x00 | 0x00 | file:/tmp/location=nk | 2 | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | 137 |
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+
mysql> select * from BUCKETING_COLS where SD_ID = 137;
+-------+-----------------+-------------+
| SD_ID | BUCKET_COL_NAME | INTEGER_IDX |
+-------+-----------------+-------------+
| 137 | deptno | 0 |
+-------+-----------------+-------------+
{noformat}
*Spark Sql Query*
ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 'file:///Users/someuser/sparkdata/location=nk1';
*HMS state after Running the Spark SQL query*
{noformat}
mysql> select * from SDS where SD_ID = 137;
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+
| SD_ID | CD_ID | INPUT_FORMAT | IS_COMPRESSED | IS_STOREDASSUBDIRECTORIES | LOCATION | NUM_BUCKETS | OUTPUT_FORMAT | SERDE_ID |
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+
| 137 | 106 | org.apache.hadoop.mapred.TextInputFormat | 0x00 | 0x00 | file:/tmp/location=nk1 | 0 | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | 137 |
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+
mysql> select * from BUCKETING_COLS where SD_ID = 141;
Empty set (0.00 sec)
{noformat}
*Problem*
1. NUM_BUCKETS value is set 0 instead it should be 2
2. Row containing BUCKET_COL_NAME = deptno is deleted from table BUCKETING_COLS
Due to above 2 things in the read path Presto is not able to detect the bucketing information for partition nk1.
was:
Presto fail to read partition of hive table update by spark sql with following error
{noformat}
Hive table (bucketdb.dept_part_buk) is bucketed but partition (location=nk) is not bucketed
{noformat}
Spark SQL which cause read failure in presto
{noformat}
ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 'file:///tmp/location=nk1';
{noformat}
*Root Cause*
ALTER TABLE table_name [PARTITION partition_spec] SET LOCATION "loc";
is dropping bucket columns information in HMS.
Repo Script
{code:java}
CREATE TABLE bucketdb.dept_part_buk ( deptno INT, dname STRING, location STRING) PARTITIONED BY (location) CLUSTERED BY (deptno) INTO 2 BUCKETS STORED AS textfile location 'file:///tmp';
ALTER TABLE bucketdb.dept_part_buk ADD IF NOT EXISTS PARTITION (location='nk') LOCATION 'file:///tmp/location=nk';
ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 'file:///tmp/location=nk1';
{code}
*Investigation*
Before running query HMS state
{noformat}
mysql> select * from SDS where SD_ID = 137;
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+
| SD_ID | CD_ID | INPUT_FORMAT | IS_COMPRESSED | IS_STOREDASSUBDIRECTORIES | LOCATION | NUM_BUCKETS | OUTPUT_FORMAT | SERDE_ID |
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+
| 137 | 106 | org.apache.hadoop.mapred.TextInputFormat | 0x00 | 0x00 | file:/tmp/location=nk | 2 | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | 137 |
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+
mysql> select * from BUCKETING_COLS where SD_ID = 137;
+-------+-----------------+-------------+
| SD_ID | BUCKET_COL_NAME | INTEGER_IDX |
+-------+-----------------+-------------+
| 137 | deptno | 0 |
+-------+-----------------+-------------+
{noformat}
Spark Sql Query
ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 'file:///Users/someuser/sparkdata/location=nk1';
HMS state after Running the Spark SQL query HMS
{noformat}
mysql> select * from SDS where SD_ID = 137;
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+
| SD_ID | CD_ID | INPUT_FORMAT | IS_COMPRESSED | IS_STOREDASSUBDIRECTORIES | LOCATION | NUM_BUCKETS | OUTPUT_FORMAT | SERDE_ID |
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+
| 137 | 106 | org.apache.hadoop.mapred.TextInputFormat | 0x00 | 0x00 | file:/tmp/location=nk1 | 0 | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | 137 |
+-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+
mysql> select * from BUCKETING_COLS where SD_ID = 141;
Empty set (0.00 sec)
{noformat}
> Hive table (bucketdb.dept_part_buk) is bucketed but partition (location=nk) is not bucketed
> -------------------------------------------------------------------------------------------
>
> Key: SPARK-46477
> URL: https://issues.apache.org/jira/browse/SPARK-46477
> Project: Spark
> Issue Type: Task
> Components: SQL
> Affects Versions: 3.3.2
> Reporter: Ashish Sharma
> Priority: Minor
> Labels: pull-request-available
>
> Presto fail to read partition of hive table update by spark sql with following error
> {noformat}
> Hive table (bucketdb.dept_part_buk) is bucketed but partition (location=nk) is not bucketed
> {noformat}
> Spark SQL which cause read failure in presto
> {noformat}
> ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 'file:///tmp/location=nk1';
> {noformat}
> *Root Cause*
> ALTER TABLE table_name [PARTITION partition_spec] SET LOCATION "loc";
> is dropping bucket columns information in HMS.
> Repo Script
> {code:java}
> CREATE TABLE bucketdb.dept_part_buk ( deptno INT, dname STRING, location STRING) PARTITIONED BY (location) CLUSTERED BY (deptno) INTO 2 BUCKETS STORED AS textfile location 'file:///tmp';
> ALTER TABLE bucketdb.dept_part_buk ADD IF NOT EXISTS PARTITION (location='nk') LOCATION 'file:///tmp/location=nk';
> ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 'file:///tmp/location=nk1';
> {code}
> *Investigation*
> *HMS state before running query *
> {noformat}
> mysql> select * from SDS where SD_ID = 137;
> +-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+
> | SD_ID | CD_ID | INPUT_FORMAT | IS_COMPRESSED | IS_STOREDASSUBDIRECTORIES | LOCATION | NUM_BUCKETS | OUTPUT_FORMAT | SERDE_ID |
> +-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+
> | 137 | 106 | org.apache.hadoop.mapred.TextInputFormat | 0x00 | 0x00 | file:/tmp/location=nk | 2 | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | 137 |
> +-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+--------------------------------------------+-------------+------------------------------------------------------------+----------+
> mysql> select * from BUCKETING_COLS where SD_ID = 137;
> +-------+-----------------+-------------+
> | SD_ID | BUCKET_COL_NAME | INTEGER_IDX |
> +-------+-----------------+-------------+
> | 137 | deptno | 0 |
> +-------+-----------------+-------------+
> {noformat}
> *Spark Sql Query*
> ALTER TABLE bucketdb.dept_part_buk PARTITION (location='nk') set LOCATION 'file:///Users/someuser/sparkdata/location=nk1';
> *HMS state after Running the Spark SQL query*
> {noformat}
> mysql> select * from SDS where SD_ID = 137;
> +-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+
> | SD_ID | CD_ID | INPUT_FORMAT | IS_COMPRESSED | IS_STOREDASSUBDIRECTORIES | LOCATION | NUM_BUCKETS | OUTPUT_FORMAT | SERDE_ID |
> +-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+
> | 137 | 106 | org.apache.hadoop.mapred.TextInputFormat | 0x00 | 0x00 | file:/tmp/location=nk1 | 0 | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | 137 |
> +-------+-------+------------------------------------------+------------------------------+------------------------------------------------------+---------------------------------------------+-------------+------------------------------------------------------------+----------+
> mysql> select * from BUCKETING_COLS where SD_ID = 141;
> Empty set (0.00 sec)
> {noformat}
> *Problem*
> 1. NUM_BUCKETS value is set 0 instead it should be 2
> 2. Row containing BUCKET_COL_NAME = deptno is deleted from table BUCKETING_COLS
> Due to above 2 things in the read path Presto is not able to detect the bucketing information for partition nk1.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org