You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Harish JP (Jira)" <ji...@apache.org> on 2021/05/19 09:44:00 UTC
[jira] [Updated] (HIVE-24936) Fix file name parsing and copy file
move.
[ https://issues.apache.org/jira/browse/HIVE-24936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Harish JP updated HIVE-24936:
-----------------------------
Description:
The taskId and taskAttemptId is not extracted correctly for copy files (00001_02_copy_3) and when doing a move file of an incompatible copy file the rename utility generates wrong file names. Ex: 00001_02_copy_3 is renamed to 00001_02_copy_3_1 if 00001_02_copy_3 already exists, ideally it should be 00001_02_copy_N.
Incompatible files should be always renamed using the current task or it can get deleted if the file name conflicts with another task output file. Ex: if the input file name for a task is 00005_01 and is incompatible then if we move this file, it will be treated as an output file for task id 5, attempt 1 which if exists will try to generate the same file and fail and another attempt will be made. There will be 2 files 00005_01, 00005_02, the deduping code will remove 00005_01 resulting in data loss. There are other scenarios where the same can happen.
was:The taskId and taskAttemptId is not extracted correctly for copy files (00001_02_copy_3) and when doing a move file of an incompatible copy file the rename utility generates wrong file names. Ex: 00001_02_copy_3 is renamed to 00001_02_copy_3_1 if 00001_02_copy_3 already exists, ideally it should be 00001_02_copy_N.
> Fix file name parsing and copy file move.
> -----------------------------------------
>
> Key: HIVE-24936
> URL: https://issues.apache.org/jira/browse/HIVE-24936
> Project: Hive
> Issue Type: Bug
> Components: HiveServer2
> Reporter: Harish JP
> Assignee: Harish JP
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> The taskId and taskAttemptId is not extracted correctly for copy files (00001_02_copy_3) and when doing a move file of an incompatible copy file the rename utility generates wrong file names. Ex: 00001_02_copy_3 is renamed to 00001_02_copy_3_1 if 00001_02_copy_3 already exists, ideally it should be 00001_02_copy_N.
>
> Incompatible files should be always renamed using the current task or it can get deleted if the file name conflicts with another task output file. Ex: if the input file name for a task is 00005_01 and is incompatible then if we move this file, it will be treated as an output file for task id 5, attempt 1 which if exists will try to generate the same file and fail and another attempt will be made. There will be 2 files 00005_01, 00005_02, the deduping code will remove 00005_01 resulting in data loss. There are other scenarios where the same can happen.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)