You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Konstantinos Kougios <ko...@googlemail.com> on 2015/07/01 18:06:25 UTC

binaryFiles() for 1 million files, too much memory required

Once again I am trying to read a directory tree using binary files.

My directory tree has a root dir ROOTDIR and subdirs where the files are 
located, i.e.

ROOTDIR/1
ROOTDIR/2
ROOTDIR/..
ROOTDIR/100

A total of 1 mil files split into 100 sub dirs

Using binaryFiles requires too much memory on the driver. I've also 
tried rdds of binaryFiles(each subdir) and then ++ those and 
rdd.saveAsObjectFile("outputDir"). That causes a lot of memory to be 
required in the executors!

What is the proper way to use binaryFiles with this number of files?

Thanks





---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org