You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Justin Fincher (JIRA)" <ji...@apache.org> on 2019/07/11 18:54:00 UTC
[jira] [Created] (SPARK-28353) Inserting Into Partitioned Parquet
Table Behaves Incorrectly
Justin Fincher created SPARK-28353:
--------------------------------------
Summary: Inserting Into Partitioned Parquet Table Behaves Incorrectly
Key: SPARK-28353
URL: https://issues.apache.org/jira/browse/SPARK-28353
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.4.0
Reporter: Justin Fincher
I hope this is the correct place to log this. Feel free to move as necessary.
We encountered an issue recently inserting into a partitioned parquet table. If the columns that the table is partitioned on are not the last in the column list, the data that is inserted gets improperly shuffled and improperly partitioned. Here's a simple mockup of what we're seeing.
{code:sql}
CREATE TABLE testtable
USING PARQUET
OPTIONS ('compression'='snappy')
PARTITIONED BY (state, zip3)
AS
SELECT
x.name,
x.state,
x.city,
x.county,
substring(x.zip,1,3) as zip3
FROM originaltable x
WHERE state = 'TN'
CLUSTER BY state, zip3
;
{code}
This creates a table, and looking at the underlying structure, the partitioning is as expected (i.e. /testtable/state=TN/zip3=123/, etc.) The problem arises when we try to insert into this table.
{code:sql}
INSERT INTO testtable
SELECT
x.name,
x.state,
x.city,
x.county,
substring(x.zip,1,3) as zip3
FROM originaltable x
WHERE state = 'AL'
CLUSTER BY state, zip3 ; {code}
Instead of seeing one new folder called state=AL, we see that the columns got jumbled (as if the column header for state in this example got moved to the end, but not the data). So we see a lot of new folders that appear to be county names in this example (e.g. state=FRANKLIN, state=MARION,etc.)
If the columns are reordered so that the columns on which the table are partitioned are last in the column list, the insert appears to work fine.
It would be great if it could be patched to work regardless of ordering, but at the very least I believe it should warn or error if the partitioned columns are not last in the column list, as this results in silently inserting incorrect data into the table.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org