You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2022/01/06 19:03:10 UTC
Fwd: metastore bug when hive update spark table ?

From my experience this is an spark issue (more code base diversion on
spark-sql from Hive), but of course there is the work-around as below



---------- Forwarded message ---------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Thu, 6 Jan 2022 at 17:29
Subject: Re: metastore bug when hive update spark table ?
To: user <us...@hive.apache.org>


Well I have seen this type of error before.

I tend to create the table in hive first and alter it in spark if needed.
This is spark 3.1.1 with Hive (version 3.1.1)

0: jdbc:hive2://rhes75:10099/default> create table my_table2 (col1 int,
col2 int)
0: jdbc:hive2://rhes75:10099/default> describe my_table2;
+-----------+------------+----------+
| col_name  | data_type  | comment  |
+-----------+------------+----------+
| col1      | int        |          |
| col2      | int        |          |
+-----------+------------+----------+
2 rows selected (0.17 seconds)

in Spark

>>> spark.sql("""ALTER TABLE my_table2 ADD column col3 string""")
DataFrame[]
>>> for c in spark.sql("""describe formatted my_table2 """).collect():
...   print(c)
...
*Row(col_name='col1', data_type='int', comment=None)*
*Row(col_name='col2', data_type='int', comment=None)*
*Row(col_name='col3', data_type='string', comment=None)*
Row(col_name='', data_type='', comment='')
Row(col_name='# Detailed Table Information', data_type='', comment='')
Row(col_name='Database', data_type='default', comment='')
Row(col_name='Table', data_type='my_table2', comment='')
Row(col_name='Owner', data_type='hduser', comment='')
Row(col_name='Created Time', data_type='Thu Jan 06 17:16:37 GMT 2022',
comment='')
Row(col_name='Last Access', data_type='UNKNOWN', comment='')
Row(col_name='Created By', data_type='Spark 2.2 or prior', comment='')
Row(col_name='Type', data_type='MANAGED', comment='')
Row(col_name='Provider', data_type='hive', comment='')
Row(col_name='Table Properties', data_type='[bucketing_version=2,
transient_lastDdlTime=1641489641]', comment='')
Row(col_name='Location',
data_type='hdfs://rhes75:9000/user/hive/warehouse/my_table2', comment='')
Row(col_name='Serde Library',
data_type='org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe', comment='')
Row(col_name='InputFormat',
data_type='org.apache.hadoop.mapred.TextInputFormat', comment='')
Row(col_name='OutputFormat',
data_type='org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
comment='')
Row(col_name='Storage Properties', data_type='[serialization.format=1]',
comment='')
Row(col_name='Partition Provider', data_type='Catalog', comment='')


This is my work around

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 6 Jan 2022 at 16:17, Nicolas Paris <ni...@riseup.net> wrote:

> Hi there.
>
> I also posted this problem in the spark list. I am no sure this is a
> spark or a hive metastore problem. Or if there is some metastore tunning
> configuration as workaround.
>
>
> Spark can't see hive schema updates partly because it stores the schema
> in a weird way in hive metastore.
>
>
> 1. FROM SPARK: create a table
> ============
> >>> spark.sql("select 1 col1, 2
> col2").write.format("parquet").saveAsTable("my_table")
> >>> spark.table("my_table").printSchema()
> root
> |-- col1: integer (nullable = true)
> |-- col2: integer (nullable = true)
>
>
> 2. FROM HIVE: alter the schema
> ==========
> 0: jdbc:hive2://localhost:10000> ALTER TABLE my_table REPLACE
> COLUMNS(`col1` int, `col2` int, `col3` string);
> 0: jdbc:hive2://localhost:10000> describe my_table;
> +-----------+------------+----------+
> | col_name | data_type | comment |
> +-----------+------------+----------+
> | col1 | int | |
> | col2 | int | |
> | col3 | string | |
> +-----------+------------+----------+
>
>
> 3. FROM SPARK: problem, column does not appear
> ==============
> >>> spark.table("my_table").printSchema()
> root
> |-- col1: integer (nullable = true)
> |-- col2: integer (nullable = true)
>
>
> 4. FROM METASTORE DB: two ways of storing the columns
> ======================
> metastore=# select * from "COLUMNS_V2";
> CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX
> -------+---------+-------------+-----------+-------------
> 2 | | col1 | int | 0
> 2 | | col2 | int | 1
> 2 | | col3 | string | 2
>
>
> metastore=# select * from "TABLE_PARAMS";
> TBL_ID | PARAM_KEY | PARAM_VALUE
>
>
> --------+-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------
> -------------------------------
> 1 | spark.sql.sources.provider | parquet
> 1 | spark.sql.sources.schema.part.0 |
>
> {"type":"struct","fields":[{"name":"col1","type":"integer","nullable":true,"metadata":{}},{"name":"col2","type":"integer","n
> ullable":true,"metadata":{}}]}
> 1 | spark.sql.create.version | 2.4.8
> 1 | spark.sql.sources.schema.numParts | 1
> 1 | last_modified_time | 1641483180
> 1 | transient_lastDdlTime | 1641483180
> 1 | last_modified_by | anonymous
>
> metastore=# truncate "TABLE_PARAMS";
> TRUNCATE TABLE
>
>
> 5. FROM SPARK: now the column magically appears
> ==============
> >>> spark.table("my_table").printSchema()
> root
> |-- col1: integer (nullable = true)
> |-- col2: integer (nullable = true)
> |-- col3: string (nullable = true)
>
>
> Then is it necessary to store that stuff in the TABLE_PARAMS ?
>
>
>