You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2021/03/16 11:18:00 UTC

[jira] [Commented] (HADOOP-17428) ABFS: Implementation for getContentSummary

    [ https://issues.apache.org/jira/browse/HADOOP-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302431#comment-17302431 ] 

Steve Loughran commented on HADOOP-17428:
-----------------------------------------

I turns out that Hive does use this for some of its calculations of size of unmanaged tables, so its performance *does* matter -at least until/unless we can move Hive off this. So apparently does spark.

Personally, I don't think they should be using it as it is doing an expensive treewalk, but they probably aren't aware of its cost.

Could abfs do some of the treewalk in parallel? I think for s3a I'd go to the deep listing (listFiles(Recursive=true) and make up some directory count number

> ABFS: Implementation for getContentSummary
> ------------------------------------------
>
>                 Key: HADOOP-17428
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17428
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/azure
>    Affects Versions: 3.3.0
>            Reporter: Sumangala Patki
>            Assignee: Sumangala Patki
>            Priority: Major
>
> Adds implementation for HDFS method getContentSummary, which takes in a Path argument and returns details such as file/directory count and space utilized under that path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org