You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Justin Fincher (JIRA)" <ji...@apache.org> on 2019/07/11 18:54:00 UTC

[jira] [Created] (SPARK-28353) Inserting Into Partitioned Parquet Table Behaves Incorrectly

Justin Fincher created SPARK-28353:
--------------------------------------

             Summary: Inserting Into Partitioned Parquet Table Behaves Incorrectly
                 Key: SPARK-28353
                 URL: https://issues.apache.org/jira/browse/SPARK-28353
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.0
            Reporter: Justin Fincher


I hope this is the correct place to log this. Feel free to move as necessary.

We encountered an issue recently inserting into a partitioned parquet table. If the columns that the table is partitioned on are not the last in the column list, the data that is inserted gets improperly shuffled and improperly partitioned. Here's a simple mockup of what we're seeing. 
{code:sql}
CREATE TABLE testtable
USING PARQUET
OPTIONS ('compression'='snappy')
PARTITIONED BY (state, zip3)
AS
SELECT
  x.name,
  x.state,
  x.city,
  x.county,
  substring(x.zip,1,3) as zip3
FROM originaltable x
WHERE state = 'TN'
CLUSTER BY state, zip3
;
{code}
This creates a table, and looking at the underlying structure, the partitioning is as expected (i.e. /testtable/state=TN/zip3=123/, etc.) The problem arises when we try to insert into this table.
{code:sql}
INSERT INTO testtable 
SELECT 
  x.name, 
  x.state, 
  x.city, 
  x.county, 
  substring(x.zip,1,3) as zip3 
FROM originaltable x 
WHERE state = 'AL' 
CLUSTER BY state, zip3 ; {code}
Instead of seeing one new folder called state=AL, we see that the columns got jumbled (as if the column header for state in this example got moved to the end, but not the data). So we see a lot of new folders that appear to be county names in this example (e.g. state=FRANKLIN, state=MARION,etc.)

If the columns are reordered so that the columns on which the table are partitioned are last in the column list, the insert appears to work fine.

It would be great if it could be patched to work regardless of ordering, but at the very least I believe it should warn or error if the partitioned columns are not last in the column list, as this results in silently inserting incorrect data into the table.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org