You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Ahmed Hussein (Jira)" <ji...@apache.org> on 2020/11/06 20:13:00 UTC

[jira] [Created] (HADOOP-17362) Doing hadoop ls on Har file triggers too many RPC calls

Ahmed Hussein created HADOOP-17362:
--------------------------------------

             Summary: Doing hadoop ls on Har file triggers too many RPC calls
                 Key: HADOOP-17362
                 URL: https://issues.apache.org/jira/browse/HADOOP-17362
             Project: Hadoop Common
          Issue Type: Bug
          Components: fs
            Reporter: Ahmed Hussein
            Assignee: Ahmed Hussein


[~daryn] has noticed that Invoking hadoop ls on HAR is taking too much of time.

The har system has multiple deficiencies that significantly impacted performance:

# Parsing the master index references ranges within the archive index. Each range required re-opening the hdfs input stream and seeking to the same location where it previously stopped.
# Listing a har stats the archive index for every "directory". The per-call cache used a unique key for each stat, rendering the cache useless and significantly increasing memory pressure.
# Determining the children of a directory scans the entire archive contents and filters out children. The cached metadata already stores the exact child list.
# Globbing a har's contents resulted in unnecessary stats for every leaf path.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org