You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Ahmed Hussein (Jira)" <ji...@apache.org> on 2020/11/06 20:13:00 UTC
[jira] [Created] (HADOOP-17362) Doing hadoop ls on Har file
triggers too many RPC calls
Ahmed Hussein created HADOOP-17362:
--------------------------------------
Summary: Doing hadoop ls on Har file triggers too many RPC calls
Key: HADOOP-17362
URL: https://issues.apache.org/jira/browse/HADOOP-17362
Project: Hadoop Common
Issue Type: Bug
Components: fs
Reporter: Ahmed Hussein
Assignee: Ahmed Hussein
[~daryn] has noticed that Invoking hadoop ls on HAR is taking too much of time.
The har system has multiple deficiencies that significantly impacted performance:
# Parsing the master index references ranges within the archive index. Each range required re-opening the hdfs input stream and seeking to the same location where it previously stopped.
# Listing a har stats the archive index for every "directory". The per-call cache used a unique key for each stat, rendering the cache useless and significantly increasing memory pressure.
# Determining the children of a directory scans the entire archive contents and filters out children. The cached metadata already stores the exact child list.
# Globbing a har's contents resulted in unnecessary stats for every leaf path.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org