You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/06/19 10:53:00 UTC

[jira] [Resolved] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

     [ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-21137.
-------------------------------
    Resolution: Invalid

This is a question for the mailing list, not JIRA.
It's not clear you're actually spending time reading files here -- there's no detail about your app or what you're looking at.

> Spark cannot read many small files (wholeTextFiles)
> ---------------------------------------------------
>
>                 Key: SPARK-21137
>                 URL: https://issues.apache.org/jira/browse/SPARK-21137
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.1
>            Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  Firstly, even if the data is small (each file only say 1K) any job can take a very long time (I have a simple job that has been running for 3 hours and has not yet got to the point of starting any tasks, I doubt if it will ever finish).
> It seems all the code in Spark that manages file listing is single threaded and not well optimised.  When I hand crank the code and don't use Spark, my job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda surprising to me that Spark cannot read Enron data given that it's such a quintessential example.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org