You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Ping Liu (JIRA)" <ji...@apache.org> on 2017/09/01 09:03:00 UTC
[jira] [Commented] (HADOOP-14600) LocatedFileStatus constructor
forces RawLocalFS to exec a process to get the permissions
[ https://issues.apache.org/jira/browse/HADOOP-14600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150245#comment-16150245 ]
Ping Liu commented on HADOOP-14600:
-----------------------------------
I just followed [~steve_l]'s idea to add stat() native implementation. Yes, it is similar to fstat() but doesn't need open file as it doesn't require a file descriptor.
Now there is no need to spawn extra thread to gather process info any more.
I did some manual test on both Windows 10 and Linux (Ubuntu on VirtualBox). It looks like it has dramatic improvement on both systems.
{noformat}
Windows
number of files time (ms) time (ms) with native IO
100 14274 1234
150 19002 1782
200 21865 2250
500 timed out 5125
1000 timed out 9735
2000 timed out 18875
Linux
number of files time (ms) time (ms) with native IO
100 4539 1632
150 6137 2031
200 7139 2764
500 15566 5292
1000 timed out 7490
2000 timed out 14040
{noformat}
The test is primitive but sufficiently shows the improvement.
Attached is the patch file: *HADOOP-14600__Patch__20170901.txt*.
When doing the test, I added testListStatusForPerformance() to TestRawLocalFileSystem.java. Also attached above.
> LocatedFileStatus constructor forces RawLocalFS to exec a process to get the permissions
> ----------------------------------------------------------------------------------------
>
> Key: HADOOP-14600
> URL: https://issues.apache.org/jira/browse/HADOOP-14600
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs
> Affects Versions: 2.7.3
> Environment: file:// in a dir with many files
> Reporter: Steve Loughran
>
> Reported in SPARK-21137. a {{FileSystem.listStatus}} call really craws against the local FS, because {{FileStatus.getPemissions}} call forces {{DeprecatedRawLocalFileStatus}} tp spawn a process to read the real UGI values.
> That is: for every other FS, what's a field lookup or even a no-op, on the local FS it's a process exec/spawn, with all the costs. This gets expensive if you have many files.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org