You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Vihang Karajgaonkar (JIRA)" <ji...@apache.org> on 2018/12/18 02:21:00 UTC

[jira] [Commented] (HIVE-21040) msck does unnecessary file listing at last level of directory tree

    [ https://issues.apache.org/jira/browse/HIVE-21040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723595#comment-16723595 ] 

Vihang Karajgaonkar commented on HIVE-21040:
--------------------------------------------

I spent a lot of time to figure out a good way to test this. {{FileSystem}} actually provides APIs to get statistics but for some reason I am not able to use it in the test framework to confirm that number listStatus calls are as expected. Will try to dig more into it. If anyone has more ideas, please let me know.

> msck does unnecessary file listing at last level of directory tree
> ------------------------------------------------------------------
>
>                 Key: HIVE-21040
>                 URL: https://issues.apache.org/jira/browse/HIVE-21040
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Vihang Karajgaonkar
>            Assignee: Vihang Karajgaonkar
>            Priority: Major
>         Attachments: HIVE-21040.01.patch
>
>
> Here is the code snippet which is run by {{msck}} to list directories
> {noformat}
> final Path currentPath = pd.p;
>       final int currentDepth = pd.depth;
>       FileStatus[] fileStatuses = fs.listStatus(currentPath, FileUtils.HIDDEN_FILES_PATH_FILTER);
>       // found no files under a sub-directory under table base path; it is possible that the table
>       // is empty and hence there are no partition sub-directories created under base path
>       if (fileStatuses.length == 0 && currentDepth > 0 && currentDepth < partColNames.size()) {
>         // since maxDepth is not yet reached, we are missing partition
>         // columns in currentPath
>         logOrThrowExceptionWithMsg(
>             "MSCK is missing partition columns under " + currentPath.toString());
>       } else {
>         // found files under currentPath add them to the queue if it is a directory
>         for (FileStatus fileStatus : fileStatuses) {
>           if (!fileStatus.isDirectory() && currentDepth < partColNames.size()) {
>             // found a file at depth which is less than number of partition keys
>             logOrThrowExceptionWithMsg(
>                 "MSCK finds a file rather than a directory when it searches for "
>                     + fileStatus.getPath().toString());
>           } else if (fileStatus.isDirectory() && currentDepth < partColNames.size()) {
>             // found a sub-directory at a depth less than number of partition keys
>             // validate if the partition directory name matches with the corresponding
>             // partition colName at currentDepth
>             Path nextPath = fileStatus.getPath();
>             String[] parts = nextPath.getName().split("=");
>             if (parts.length != 2) {
>               logOrThrowExceptionWithMsg("Invalid partition name " + nextPath);
>             } else if (!parts[0].equalsIgnoreCase(partColNames.get(currentDepth))) {
>               logOrThrowExceptionWithMsg(
>                   "Unexpected partition key " + parts[0] + " found at " + nextPath);
>             } else {
>               // add sub-directory to the work queue if maxDepth is not yet reached
>               pendingPaths.add(new PathDepthInfo(nextPath, currentDepth + 1));
>             }
>           }
>         }
>         if (currentDepth == partColNames.size()) {
>           return currentPath;
>         }
>       }
> {noformat}
> You can see that when the {{currentDepth}} at the {{maxDepth}} it still does a unnecessary listing of the files. We can improve this call by checking the currentDepth and bailing out early.
> This can improve the performance of msck command significantly especially when there are lot of files in each partitions on remote filesystems like S3 or ADLS



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)