You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/10/03 19:05:01 UTC

[jira] [Commented] (HIVE-17608) REPL LOAD should overwrite the data files if exists instead of duplicating it

    [ https://issues.apache.org/jira/browse/HIVE-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16190161#comment-16190161 ] 

ASF GitHub Bot commented on HIVE-17608:
---------------------------------------

GitHub user sankarh opened a pull request:

    https://github.com/apache/hive/pull/255

    HIVE-17608: REPL LOAD should overwrite the data files if exists instead of duplicating it

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sankarh/hive HIVE-17608

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/hive/pull/255.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #255
    
----
commit f7731b20bee79adbc9de831e673d9361144e6500
Author: Sankar Hariappan <ma...@gmail.com>
Date:   2017-09-28T16:38:44Z

    HIVE-17608: REPL LOAD should overwrite the data files if exists instead of duplicating it

----


> REPL LOAD should overwrite the data files if exists instead of duplicating it
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-17608
>                 URL: https://issues.apache.org/jira/browse/HIVE-17608
>             Project: Hive
>          Issue Type: Sub-task
>          Components: HiveServer2, repl
>    Affects Versions: 3.0.0
>            Reporter: Sankar Hariappan
>            Assignee: Sankar Hariappan
>              Labels: DR, pull-request-available, replication
>             Fix For: 3.0.0
>
>         Attachments: HIVE-17608.01.patch
>
>
> This is to make insert event idempotent.
> Currently, MoveTask would create a new file if the destination folder contains a file of the same name. This is wrong if we have the same file in both bootstrap dump and incremental dump (by design, duplicate file in incremental dump will be ignored for idempotent reason), we will get duplicate files eventually. Also it is wrong to just retain the filename in the staging folder. Suppose we get the same insert event twice, the first time we get the file from source table folder, the second time we get the file from cm, we still end up with duplicate copy. The right solution is to keep the same file name as the source table folder.
> To do that, we can put the original filename in MoveWork, and in MoveTask, if original filename is set, don't generate a new name, simply overwrite. We need to do it in both bootstrap and incremental load.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)