You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Steven Cardella (JIRA)" <ji...@apache.org> on 2018/10/18 22:04:00 UTC

[jira] [Created] (SPARK-25774) Eliminate query anomalies with empty partitions - TRUNCATE, SELECT DISTINCT, etc.

Steven Cardella created SPARK-25774:
---------------------------------------

             Summary: Eliminate query anomalies with empty partitions - TRUNCATE, SELECT DISTINCT, etc.
                 Key: SPARK-25774
                 URL: https://issues.apache.org/jira/browse/SPARK-25774
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.2.0
         Environment: Right now, I'm using Cloudera with Spark 2.2.0, but I understand it's a widespread thing.
            Reporter: Steven Cardella


If you run a spark SQL TRUNCATE TABLE command on a managed table in Hive, it deletes the files in HDFS but leaves the partitions and partition folder structure.  If you then SELECT DISTINCT on the partition columns, it returns all the empty partition values.  So, you can have a SELECT DISTINCT return rows but SELECT * on the same table returns 0 rows.  

Coming from SQL Server and the like, SELECT DISTINCT always reflects the ROWS, and Impala works like that as well.  

I'd like SELECT DISTINCT to reflect rows, not partitions, TRUNCATE TABLE to have the option to drop partitions, and MSCK REPAIR TABLE to have the option to drop empty partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org