You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Christopher Piggott <cp...@gmail.com> on 2018/01/08 15:03:28 UTC

binaryFiles() on directory full of directories

I have a top level directory in HDFS that contains nothing but
subdirectories (no actual files).  In each one of those subdirs are a
combination of files and other subdirs

        /topdir/dir1/(lots of files)
        /topdir/dir2/(lots of files)
        /topdir/dir2//subdir/(lots of files)

I noticed something strange:

spark.sparkContext.binaryFiles("hdfs://10.240.2.200/topdir/*", 32*8)
    .filter { case (fileName, contents) => fileName.endsWith(".xyz") }
    .map { case (fileName, contents) => 1}
    .reduce(_+_)

fails with an ArrayOutOfBoundsException ... but if I specify it as:

  spark.sparkContext.binaryFiles("hdfs://10.240.2.200/topdir/*/*", 32*8)

it works.

I played around a little and found I could get the first attempt to work if
I just put one regular file in /topdir

This is with Spark 2.2.1

Is this known behavior?

--C