You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Junjie Qian <qi...@outlook.com> on 2016/02/11 18:33:24 UTC

How to parallel read files in a directory

Hi all,
I am working with Spark 1.6, scala and have a big dataset divided into several small files.
My question is: right now the read operation takes really long time and often has RDD warnings. Is there a way I can read the files in parallel, that all nodes or workers read the file at the same time?
Many thanksJunjie

Re: How to parallel read files in a directory

Posted by Arkadiusz Bicz <ar...@gmail.com>.

Hi Junjie,

>From my experience HDFS is slow reading large amount of small files as
every file come with lot of information from namenode and data nodes.
When file size is bellow HDFS default block (usually 64MB or 128MB)
size you can not use fully optimizations of Hadoop to read  in
streamed way lot of data.

Also when using DataFrames there is huge overhead by caching files
information as described in
https://issues.apache.org/jira/browse/SPARK-11441

BR,
Arkadiusz Bicz
https://www.linkedin.com/in/arkadiuszbicz

On Thu, Feb 11, 2016 at 7:24 PM, Jakob Odersky <ja...@odersky.com> wrote:
> Hi Junjie,
>
> How do you access the files currently? Have you considered using hdfs? It's
> designed to be distributed across a cluster and Spark has built-in support.
>
> Best,
> --Jakob
>
> On Feb 11, 2016 9:33 AM, "Junjie Qian" <qi...@outlook.com> wrote:
>>
>> Hi all,
>>
>> I am working with Spark 1.6, scala and have a big dataset divided into
>> several small files.
>>
>> My question is: right now the read operation takes really long time and
>> often has RDD warnings. Is there a way I can read the files in parallel,
>> that all nodes or workers read the file at the same time?
>>
>> Many thanks
>> Junjie

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to parallel read files in a directory

Posted by Jakob Odersky <ja...@odersky.com>.

Hi Junjie,

How do you access the files currently? Have you considered using hdfs? It's
designed to be distributed across a cluster and Spark has built-in support.

Best,
--Jakob
On Feb 11, 2016 9:33 AM, "Junjie Qian" <qi...@outlook.com> wrote:

> Hi all,
>
> I am working with Spark 1.6, scala and have a big dataset divided into
> several small files.
>
> My question is: right now the read operation takes really long time and
> often has RDD warnings. Is there a way I can read the files in parallel,
> that all nodes or workers read the file at the same time?
>
> Many thanks
> Junjie
>

Re: How to parallel read files in a directory

Posted by Jörn Franke <jo...@gmail.com>.

Put many small files in Hadoop Archives (HAR) to improve performance of reading small files. Alternatively have a batch job concatenating them.

> On 11 Feb 2016, at 18:33, Junjie Qian <qi...@outlook.com> wrote:
> 
> Hi all,
> 
> I am working with Spark 1.6, scala and have a big dataset divided into several small files.
> 
> My question is: right now the read operation takes really long time and often has RDD warnings. Is there a way I can read the files in parallel, that all nodes or workers read the file at the same time?
> 
> Many thanks
> Junjie