You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/11 09:01:52 UTC

[GitHub] [spark] ScrapCodes opened a new pull request #24585: Performance issue while listing large number of files on an object store.

ScrapCodes opened a new pull request #24585: Performance issue while listing large number of files on an object store.
URL: https://github.com/apache/spark/pull/24585

## What changes were proposed in this pull request?

Currently, Spark uses FileStatusCache to cache the listings while scanning a filesystem path. If this file system is on a remote storage like Object store (Amazon s3 or IBM COS), then this cache is of prime importance as it saves round trips of fetching listing over network over and over again.

FileStatusCache uses guava cache underneath, which is configured with reasonably high default value. But, when remote listing is large >100K, the size requirement of this cache is also very high. Currently, this underlying guava cache is configured with default concurrency level of 4. The effect of this is, that a single entry can only be as large as less than `maxSizeOfCache/concurrencyLevel` [1]. Quite often, users have everything listed under a single directory or path on an object store, and as a result the entire fileStatus array containing 100k + entries is inserted as a single entry in the cache. So cache requirement grows more than 4x.

Please refer to Jira [link](https://issues.apache.org/jira/browse/SPARK-27664) for more detailed explanation.

In this patch, we make default concurrency level for underlying guava cache as 1 and makes it configurable, as this cache stores only a few but very large entries in reality. So the performance penalty will be very less, if any.

I am open to work on an alternative solution as well, please feel free to discuss them.

[1]. https://github.com/google/guava/issues/3462

## How was this patch tested?

Existing tests should pass.
Manually verified the expected behaviour against a path with large listing ~ 200K.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org