You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Xiaomeng Wan <sh...@gmail.com> on 2016/11/08 17:31:05 UTC

read large number of files on s3

Hi,
We have 30 million small (100k each) files on s3 to process. I am thinking
about something like below to load them in parallel

val data = sc.union(sc.wholeTextFiles("s3a://.../*.json").map(...)
.toDF().createOrReplaceTempView("data")

How to estimate the driver memory it should be given? is there better
practice? or should I merge them in preprocess? Thanks in advance.

Regards,
Shawn