You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2017/06/27 20:33:00 UTC

[jira] [Commented] (SPARK-21137) Spark reads many small files slowly

    [ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065423#comment-16065423 ] 

Steve Loughran commented on SPARK-21137:
----------------------------------------

Looking at this.

something is trying to get the permissions for every file, which is being dealt with by an exec & all the overheads of that. Looking at the code, it's in the constructor of {{LocatedFileStatus}}, which is building it from another {{FileStatus}}. Which normally is just a simple copy of a field (fast, efficient). Looks like on RawLocalFileSystem, it actually triggers an on demand execution. Been around for a long time (HADOOP-2288), surfacing here because you're working with the local FS. For all other filesystems it's a quick operation.

I think this is an issue: I don't think anybody thought this would be a problem, as it's just viewed as a marshalling of a LocatedFileStatus, which is what you get back from {{FileSystem.listLocatedStatus}}. Normally that's the higher performing one, not just on object stores, but because it scales better, being able to incrementally send back data in batches, rather than needing to enumerate an entire directory of files (possibly in the millions) and then send them around as arrays of FileStatus.  Here, it's clearly not.

What to do? I think we could consider whether it'd be possible to add this to the hadoop native libs & so make a fast API call. There's also the option of "allowing us to completely disable permissions entirely". That one appeals to me more from a windows perspective, where you could get rid of the hadoop native lib and still have (most) things work there...but as its an incomplete "most" it's probably an optimistic goal.



> Spark reads many small files slowly
> -----------------------------------
>
>                 Key: SPARK-21137
>                 URL: https://issues.apache.org/jira/browse/SPARK-21137
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.1.1
>            Reporter: sam
>            Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  Firstly, even if the data is small (each file only say 1K) any job can take a very long time (I have a simple job that has been running for 3 hours and has not yet got to the point of starting any tasks, I doubt if it will ever finish).
> It seems all the code in Spark that manages file listing is single threaded and not well optimised.  When I hand crank the code and don't use Spark, my job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda surprising to me that Spark cannot read Enron data given that it's such a quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it then takes another hour to output the same line. Then it outputs a CSV of all the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org