You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Christopher Piggott <cp...@gmail.com> on 2018/01/08 15:03:28 UTC
binaryFiles() on directory full of directories
I have a top level directory in HDFS that contains nothing but
subdirectories (no actual files). In each one of those subdirs are a
combination of files and other subdirs
/topdir/dir1/(lots of files)
/topdir/dir2/(lots of files)
/topdir/dir2//subdir/(lots of files)
I noticed something strange:
spark.sparkContext.binaryFiles("hdfs://10.240.2.200/topdir/*", 32*8)
.filter { case (fileName, contents) => fileName.endsWith(".xyz") }
.map { case (fileName, contents) => 1}
.reduce(_+_)
fails with an ArrayOutOfBoundsException ... but if I specify it as:
spark.sparkContext.binaryFiles("hdfs://10.240.2.200/topdir/*/*", 32*8)
it works.
I played around a little and found I could get the first attempt to work if
I just put one regular file in /topdir
This is with Spark 2.2.1
Is this known behavior?
--C