You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Asif (Jira)" <ji...@apache.org> on 2023/04/12 22:12:00 UTC
[jira] [Created] (SPARK-43112) Spark may use a column other than the actual specified partitioning column for partitioning, for Hive format tables
Asif created SPARK-43112:
----------------------------
Summary: Spark may use a column other than the actual specified partitioning column for partitioning, for Hive format tables
Key: SPARK-43112
URL: https://issues.apache.org/jira/browse/SPARK-43112
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.3.1
Reporter: Asif
The class org.apache.spark.sql.catalyst.catalog.HiveTableRelation has its output method implemented as
// The partition column should always appear after data columns.
override def output: Seq[AttributeReference] = dataCols ++ partitionCols
But the DataWriting commands of spark like InsertIntoHiveDirCommand, expect that the out from HiveTableRelation is in the order in which the columns are actually defined in the DDL.
As a result, multiple mistmatch scenarios can happen like:
1) data type casting exception being thrown , even though the data frame being inserted has schema which is identical to what is used for creating ddl.
OR
2) Wrong column being used for partitioning , if the datatypes are same or castable, like datetype and long
will be creating a PR with the bug test
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org