You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Matthew Cheah <ma...@gmail.com> on 2014/02/17 03:14:32 UTC

Using local[N] gets "Too many open files"

Hi everyone,

I'm experimenting with Spark in both a distributed environment and as a
multi-threaded local application.

When I set the spark master to local[8] and attempt to read a ~20GB text
file on the local file system into an RDD and perform computations on it, I
don't get an out of memory error, but rather a "Too many open files" error.
Is there a reason why this happens? How aggressively is Spark partitioning
the data into intermediate files?

I have also tried splitting the text file into numerous text files - around
100,000 of them - and processing 10,000 of them at a time sequentially.
However then Spark seems to get bottlenecked on reading each individual
file into the RDD before proceeding with the computation. This has issues
even reading 10,000 files at once. I would have thought that Spark could do
I/O in parallel with computation, but it seems that Spark does all of the
I/O first?

I was wondering if Spark was simply just not built for local applications
outside of testing.

Thanks,

-Matt Cheah