You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Jeffrey(Xilang) Yan (Jira)" <ji...@apache.org> on 2020/05/22 07:58:00 UTC

[jira] [Comment Edited] (HIVE-22077) Inserting overwrite partitions clause does not clean directories while partitions' info is not stored in metadata

    [ https://issues.apache.org/jira/browse/HIVE-22077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113829#comment-17113829 ] 

Jeffrey(Xilang) Yan edited comment on HIVE-22077 at 5/22/20, 7:57 AM:
----------------------------------------------------------------------

We meet exactly same issue on production. Insert overwrite sql failed due to hive metastore lock, retry the sql doesn't remove old data which make many many duplicate data left in hdfs. It is a nightmare now, we have to find all partition which have duplicate data.
 Could someone help to revew this patch? 

[~kgyrtkirk] [~jcamachorodriguez] [~mgergely] [~ashutoshc]


was (Author: xilangyan):
We meet exactly same issue on production. Insert overwrite sql failed due to hive metastore lock, retry the sql doesn't remove old data which make many many duplicate data left in hdfs. It is a nightmare now, we have to find all partition which have duplicate data.
Could someone help to revew this patch? 

[~kgyrtkirk] [~jcamachorodriguez]

> Inserting overwrite partitions clause does not clean directories while partitions' info is not stored in metadata
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-22077
>                 URL: https://issues.apache.org/jira/browse/HIVE-22077
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 1.1.1, 4.0.0, 2.3.4
>            Reporter: Hui An
>            Assignee: Hui An
>            Priority: Major
>         Attachments: HIVE-22077.patch.1
>
>
> Inserting overwrite static partitions may not clean related HDFS location if partitions' info is not stored in metadata.
> Steps to reproduce this issue : 
> ------------------------------------------------
> 1. Create a managed table :
> ------------------------------------------------
> {code:sql}
>  CREATE TABLE `test`(                               
>    `id` string)                                     
>  PARTITIONED BY (                                   
>    `dayno` string)                                  
>  ROW FORMAT SERDE                                   
>    'org.apache.hadoop.hive.ql.io.orc.OrcSerde'      
>  STORED AS INPUTFORMAT                              
>    'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'  
>  OUTPUTFORMAT                                       
>    'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' 
>  LOCATION                                           
>    'hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test' 
>  TBLPROPERTIES (                                    
>    'transient_lastDdlTime'='1564731656')   
> {code}
> ------------------------------------------------
> 2. Create partition's directory and put some data in it
> ------------------------------------------------
> {code:java}
> hdfs dfs -mkdir hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test/dayno=20190802
> hdfs dfs -put test.data hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test/dayno=20190802
> {code}
> ------------------------------------------------
> 3. Insert overwrite partition dayno=20190802
> ------------------------------------------------
> {code:sql}
> INSERT OVERWRITE TABLE test PARTITION(dayno='20190802')
> SELECT "some value";
> {code}
> ------------------------------------------------
> 4. We could see the test.data under partition directory is not deleted.
> ------------------------------------------------



--
This message was sent by Atlassian Jira
(v8.3.4#803005)