You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/03/10 09:46:00 UTC
[jira] [Work logged] (HIVE-24866) FileNotFoundException during alter table concat

     [ https://issues.apache.org/jira/browse/HIVE-24866?focusedWorklogId=563599&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-563599 ]

ASF GitHub Bot logged work on HIVE-24866:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 10/Mar/21 09:45
            Start Date: 10/Mar/21 09:45
    Worklog Time Spent: 10m 
      Work Description: prasanthj opened a new pull request #2057:
URL: https://github.com/apache/hive/pull/2057


   ### What changes were proposed in this pull request?
   There has been a bug lurking in alter table concatenate for ORC which is typically observed in case where orc files are bigger and different nodes and racks. Because of the CombineFileInputFormat groups the files together based on node/rack locality and based on default max split size of 256MB, if the orc file size is >256MB and if the file spans multiple nodes/rack then CombineIF splits the file and groups then in different splits. Now when these different splits are processed by the mappers of merge task, the first task will initiate the concatenate and as part of task commit will move the file to scratch dir. Now when the same file is processed by a different split, the will be non-existent as it was moved by the prior mapper. This can cause failures in alter table concat task and also can results in stripes being lost because of this partial concatenation. 
   This PR addresses this issue by mapping the mapper that gets the start of the split to own the entire orc file for concatenation. It will process all the stripes, concatenate them to destination file and move the source file. Mappers that does not get start of the split will simply skip as the file is already handled or will be handled by different mapper.
   
   ### Why are the changes needed?
   To avoid concatenation failures and stripe loss issues. 
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Tested in internal repro cluster which had bigger orc files that spans multiple nodes and racks. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 563599)
    Remaining Estimate: 0h
            Time Spent: 10m

> FileNotFoundException during alter table concat
> -----------------------------------------------
>
>                 Key: HIVE-24866
>                 URL: https://issues.apache.org/jira/browse/HIVE-24866
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 2.4.0, 3.2.0, 4.0.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Because of the way combinefile IF groups files based on node and rack locality, there are cases where single big orc file gets spread across 2 or more combine hive split. When first task completes, as part of jobCloseOp the source orc file of concatenation is moved/renamed which can lead to FileNotFoundException in subsequent mappers that has partial split of that file. 
> A simple fix would be for the mapper with start of the split to own the entire orc file for concatenation. If a mapper gets partial split which is not the start then it can skip the entire file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)