You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Bing Li (JIRA)" <ji...@apache.org> on 2017/07/12 16:17:00 UTC
[jira] [Assigned] (HIVE-16999) Performance bottleneck in the ADD FILE/ARCHIVE commands for an HDFS resource

     [ https://issues.apache.org/jira/browse/HIVE-16999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bing Li reassigned HIVE-16999:
------------------------------

    Assignee: Bing Li

> Performance bottleneck in the ADD FILE/ARCHIVE commands for an HDFS resource
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-16999
>                 URL: https://issues.apache.org/jira/browse/HIVE-16999
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>            Reporter: Sailee Jain
>            Assignee: Bing Li
>            Priority: Critical
>
> Performance bottleneck is found in adding resource[which is lying on HDFS] to the distributed cache. 
> Commands used are :-
> {code:java}
> 1. ADD ARCHIVE "hdfs://some_dir/archive.tar"
> 2. ADD FILE "hdfs://some_dir/file.txt"
> {code}
> Here is the log corresponding to the archive adding operation:-
> {noformat}
>  converting to local hdfs://some_dir/archive.tar
>  Added resources: [hdfs://some_dir/archive.tar
> {noformat}
> Hive is downloading the resource to the local filesystem [shown in log by "converting to local"]. 
> {color:#d04437}Ideally there is no need to bring the file to the local filesystem when this operation is all about copying the file from one location on HDFS to other location on HDFS[distributed cache].{color}
> This adds lot of performance bottleneck when the the resource is a big file and all commands need the same resource.
> After debugging around the impacted piece of code is found to be :-
> {code:java}
> public List<String> add_resources(ResourceType t, Collection<String> values, boolean convertToUnix)
>       throws RuntimeException {
>     Set<String> resourceSet = resourceMaps.getResourceSet(t);
>     Map<String, Set<String>> resourcePathMap = resourceMaps.getResourcePathMap(t);
>     Map<String, Set<String>> reverseResourcePathMap = resourceMaps.getReverseResourcePathMap(t);
>     List<String> localized = new ArrayList<String>();
>     try {
>       for (String value : values) {
>         String key;
>          {color:#d04437}//get the local path of downloaded jars{color}
>         List<URI> downloadedURLs = resolveAndDownload(t, value, convertToUnix);
>          ;
> 	.
> {code}
> {code:java}
>   List<URI> resolveAndDownload(ResourceType t, String value, boolean convertToUnix) throws URISyntaxException,
>       IOException {
>     URI uri = createURI(value);
>     if (getURLType(value).equals("file")) {
>       return Arrays.asList(uri);
>     } else if (getURLType(value).equals("ivy")) {
>       return dependencyResolver.downloadDependencies(uri);
>     } else { // goes here for HDFS
>       return Arrays.asList(createURI(downloadResource(value, convertToUnix))); // Here when the resource is not local it will download it to the local machine.
>     }
>   }
> {code}
> Here, the function resolveAndDownload() always calls the downloadResource() api in case of external filesystem. It should take into consideration the fact that - when the resource is on same HDFS then bringing it on local machine is not a needed step and can be skipped for better performance.
> Thanks,
> Sailee



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)