You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Konstantinos Kougios <ko...@googlemail.com> on 2015/07/01 18:06:25 UTC
binaryFiles() for 1 million files, too much memory required
Once again I am trying to read a directory tree using binary files.
My directory tree has a root dir ROOTDIR and subdirs where the files are
located, i.e.
ROOTDIR/1
ROOTDIR/2
ROOTDIR/..
ROOTDIR/100
A total of 1 mil files split into 100 sub dirs
Using binaryFiles requires too much memory on the driver. I've also
tried rdds of binaryFiles(each subdir) and then ++ those and
rdd.saveAsObjectFile("outputDir"). That causes a lot of memory to be
required in the executors!
What is the proper way to use binaryFiles with this number of files?
Thanks
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org