You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (JIRA)" <ji...@apache.org> on 2017/07/03 11:54:00 UTC
[jira] [Assigned] (SPARK-21137) Spark reads many small files slowly
off local filesystem
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan reassigned SPARK-21137:
-----------------------------------
Assignee: Sean Owen
> Spark reads many small files slowly off local filesystem
> --------------------------------------------------------
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 2.1.1
> Reporter: sam
> Assignee: Sean Owen
> Priority: Minor
> Fix For: 2.3.0
>
>
> A very common use case in big data is to read a large number of small files. For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues. Firstly, even if the data is small (each file only say 1K) any job can take a very long time (I have a simple job that has been running for 3 hours and has not yet got to the point of starting any tasks, I doubt if it will ever finish).
> It seems all the code in Spark that manages file listing is single threaded and not well optimised. When I hand crank the code and don't use Spark, my job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda surprising to me that Spark cannot read Enron data given that it's such a quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it then takes another hour to output the same line. Then it outputs a CSV of all the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can easily just clone, and follow the README to reproduce exactly!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org