You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Ping Liu (JIRA)" <ji...@apache.org> on 2017/09/01 09:03:00 UTC

[jira] [Commented] (HADOOP-14600) LocatedFileStatus constructor forces RawLocalFS to exec a process to get the permissions

    [ https://issues.apache.org/jira/browse/HADOOP-14600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150245#comment-16150245 ] 

Ping Liu commented on HADOOP-14600:
-----------------------------------

I just followed [~steve_l]'s idea to add stat() native implementation.  Yes, it is similar to fstat() but doesn't need open file as it doesn't require a file descriptor.

Now there is no need to spawn extra thread to gather process info any more.

I did some manual test on both Windows 10 and Linux (Ubuntu on VirtualBox).  It looks like it has dramatic improvement on both systems.

{noformat}
Windows
    number of files     time (ms)       time (ms) with native IO
    
    100                 14274           1234
    150                 19002           1782
    200                 21865           2250
    500                 timed out       5125
    
    1000                timed out       9735
    2000                timed out       18875

Linux
    number of files     time (ms)       time (ms) with native IO
    
    100                 4539            1632
    150                 6137            2031
    200                 7139            2764
    500                 15566           5292
    
    1000                timed out       7490
    2000                timed out       14040
{noformat}

The test is primitive but sufficiently shows the improvement.

Attached is the patch file: *HADOOP-14600__Patch__20170901.txt*.

When doing the test, I added testListStatusForPerformance() to TestRawLocalFileSystem.java.  Also attached above.


> LocatedFileStatus constructor forces RawLocalFS to exec a process to get the permissions
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-14600
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14600
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 2.7.3
>         Environment: file:// in a dir with many files
>            Reporter: Steve Loughran
>
> Reported in SPARK-21137. a {{FileSystem.listStatus}} call really craws against the local FS, because {{FileStatus.getPemissions}} call forces  {{DeprecatedRawLocalFileStatus}} tp spawn a process to read the real UGI values.
> That is: for every other FS, what's a field lookup or even a no-op, on the local FS it's a process exec/spawn, with all the costs. This gets expensive if you have many files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org