You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2019/11/27 11:17:00 UTC

[jira] [Commented] (HIVE-22548) Optimise Utilities.removeTempOrDuplicateFiles when moving files to final location

    [ https://issues.apache.org/jira/browse/HIVE-22548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983416#comment-16983416 ] 

Steve Loughran commented on HIVE-22548:
---------------------------------------

Also L1644 it calls path.exists() before the listFiles. Has anyone noticed that is marked as deprecated? There's a reason we warn people about it, and it's this recurrent code path of exists + operation, which duplicates the expensive check for files or directories existing.

*just call listStatus and treat a FileNotFoundException as a sign that the path doesn't exist*

It is exactly what exists() does after all.

While I'm looking at that class
h3. removeEmptyDpDirectory

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1601]

This contains a needless listFiles just to see if directory is empty.

if you use delete(path, false) (i.e. the non-recursive one), it does the check for having children internally * and rejects the call* . Just swallow any exception it raises telling you off about this fact.
 * we have a test for this for every single file system; it is the same as "rm dir" on the command line. You do not need to worry about it being implemented wrong.

h3. removeTempOrDuplicateFiles

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1757]

delete() returns false in only two conditions
 # you've tried to delete root
 # the file wasn't actually there

you shouldn't need to check and if there is any chance that some other process would delete the temp file, would convert a no-op into a failure.
h3. getFileSizeRecursively()

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1840]
 getFileSizeRecursively() is potentially really expensive too.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1853]
 this swallows all exception details. Please include the message and the nested exception. Everyone who fields support calls will appreciate this

> Optimise Utilities.removeTempOrDuplicateFiles when moving files to final location
> ---------------------------------------------------------------------------------
>
>                 Key: HIVE-22548
>                 URL: https://issues.apache.org/jira/browse/HIVE-22548
>             Project: Hive
>          Issue Type: Improvement
>          Components: Hive
>    Affects Versions: 3.1.2
>            Reporter: Rajesh Balamohan
>            Priority: Major
>
> {{Utilities.removeTempOrDuplicateFiles}}
> is very slow with cloud storage, as it executes {{listStatus}} twice and also runs in single threaded mode.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1629



--
This message was sent by Atlassian Jira
(v8.3.4#803005)