You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Harish JP (Jira)" <ji...@apache.org> on 2021/05/19 09:44:00 UTC

[jira] [Updated] (HIVE-24936) Fix file name parsing and copy file move.

     [ https://issues.apache.org/jira/browse/HIVE-24936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Harish JP updated HIVE-24936:
-----------------------------
    Description: 
The taskId and taskAttemptId is not extracted correctly for copy files (00001_02_copy_3) and when doing a move file of an incompatible copy file the rename utility generates wrong file names. Ex: 00001_02_copy_3 is renamed to 00001_02_copy_3_1 if 00001_02_copy_3 already exists, ideally it should be 00001_02_copy_N.

 

Incompatible files should be always renamed using the current task or it can get deleted if the file name conflicts with another task output file. Ex: if the input file name for a task is 00005_01 and is incompatible then if we move this file, it will be treated as an output file for task id 5, attempt 1 which if exists will try to generate the same file and fail and another attempt will be made. There will be 2 files 00005_01, 00005_02, the deduping code will remove 00005_01 resulting in data loss. There are other scenarios where the same can happen.

  was:The taskId and taskAttemptId is not extracted correctly for copy files (00001_02_copy_3) and when doing a move file of an incompatible copy file the rename utility generates wrong file names. Ex: 00001_02_copy_3 is renamed to 00001_02_copy_3_1 if 00001_02_copy_3 already exists, ideally it should be 00001_02_copy_N.


> Fix file name parsing and copy file move.
> -----------------------------------------
>
>                 Key: HIVE-24936
>                 URL: https://issues.apache.org/jira/browse/HIVE-24936
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2
>            Reporter: Harish JP
>            Assignee: Harish JP
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The taskId and taskAttemptId is not extracted correctly for copy files (00001_02_copy_3) and when doing a move file of an incompatible copy file the rename utility generates wrong file names. Ex: 00001_02_copy_3 is renamed to 00001_02_copy_3_1 if 00001_02_copy_3 already exists, ideally it should be 00001_02_copy_N.
>  
> Incompatible files should be always renamed using the current task or it can get deleted if the file name conflicts with another task output file. Ex: if the input file name for a task is 00005_01 and is incompatible then if we move this file, it will be treated as an output file for task id 5, attempt 1 which if exists will try to generate the same file and fail and another attempt will be made. There will be 2 files 00005_01, 00005_02, the deduping code will remove 00005_01 resulting in data loss. There are other scenarios where the same can happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)