You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tóth Andor (Jira)" <ji...@apache.org> on 2021/10/08 17:23:00 UTC
[jira] [Created] (SPARK-36959) Renamed columns of parquet tables become NULL

Tóth Andor created SPARK-36959:
----------------------------------

             Summary: Renamed columns of parquet tables become NULL
                 Key: SPARK-36959
                 URL: https://issues.apache.org/jira/browse/SPARK-36959
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.1.2, 3.0.3
         Environment: Executors are running on Hadoop YARN (Cloudera CDH5 5.16.2-2) cluster, with 12 nodes.

Driver runs a CentOS 8 (Release), Python 3.6.8, pip is used to install pyspark in fresh virtual environment. SPARK_HOME is set to virtual environment's pyspark directory (.../venv/lib/python3.6/site-packages/pyspark).

HADOOP_CONF_DIR is set where the Hive client configuration resides.

Spark config:
{noformat}
(spark.jars,)
(spark.app.name,org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver)
(spark.sql.hive.metastore.jars,maven)
(spark.submit.pyFiles,)
(spark.submit.deployMode,client)
(spark.master,yarn)
(spark.sql.hive.metastore.version,1.1.0){noformat}
            Reporter: Tóth Andor


If a column of a Parquet table gets renamed in Hive metastore, then Spark (SQL) is unable to read the relevant values, and all of the renamed column becomes NULL.

The problem could be reproduced with the following SQL queries:

 
{noformat}
create table tmp.parquet_table1 (i1 int, s1 string) stored as parquet;
insert into tmp.parquet_table1 values (1, "AAA"), (2, "BBB"), (3, "CCC");
select * from tmp.parquet_table1;
+---+---+
| i1| s1|
+---+---+
| 1|AAA|
| 2|BBB|
| 3|CCC|
+---+---+
alter table tmp.parquet_table1 replace columns (i2 int, s1 string);
select * from tmp.parquet_table1;
+----+---+
| i2| s1|
+----+---+
|null|AAA|
|null|BBB|
|null|CCC|
+----+---+ 
{noformat}
Notice, that column `i1` is renamed to `i2`, which became NULL afterwards. 

{{I have used Impala to create the table, and insert the values, but Spark (SQL) could be used as well. "alter table ... replace columns ..." could only be executed outside of Spark (Impala or Hive).}}

{{No [configuration option|https://spark.apache.org/docs/latest/configuration.html] helps, that I am aware of.}}

{{With Spark 2.4.8, this works correctly.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org