You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sguj <tp...@yahoo.com> on 2014/06/12 18:05:48 UTC

wholeTextFiles not working with HDFS

I'm trying to get a list of every filename in a directory from HDFS using
pySpark, and the only thing that seems like it would return the filenames is
the wholeTextFiles function. My code for just trying to collect that data is
this:

       files = sc.wholeTextFiles("hdfs://localhost:port/users/me/target")
       files = files.collect()

These lines return the error "java.io.FileNotFoundException: File
/user/me/target/capacity-scheduler.xml does not exist" which makes it seem
like the hdfs information isn't getting used with the wholeTextFiles
function. 

Those lines work if I use them on a local filesystem directory, and the
textFile() function works on the HDFS directory I'm trying to use
wholeTextFiles() on.

I need a way to either fix this, or an alternate method of reading the
filenames from a directory in HDFS.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: wholeTextFiles not working with HDFS

Posted by Sguj <tp...@yahoo.com>.

I can write one if you'll point me to where I need to write it.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7737.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: wholeTextFiles not working with HDFS

Posted by Xusen Yin <yi...@gmail.com>.

Hi Sguj and littlebird,

I'll try to fix it tomorrow evening and the day after tomorrow, because I
am now busy preparing a talk (slides) tomorrow. Sorry for the inconvenience
to you. Would you mind to write an issue on Spark JIRA?


2014-06-17 20:55 GMT+08:00 Sguj <tp...@yahoo.com>:

> I didn't fix the issue so much as work around it. I was running my cluster
> locally, so using HDFS was just a preference. The code worked with the
> local
> file system, so that's what I'm using until I can get some help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7726.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>



-- 
Best Regards
-----------------------------------
Xusen Yin    (尹绪森)
Intel Labs China
Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*

Re: wholeTextFiles not working with HDFS

Posted by Sguj <tp...@yahoo.com>.

I didn't fix the issue so much as work around it. I was running my cluster
locally, so using HDFS was just a preference. The code worked with the local
file system, so that's what I'm using until I can get some help.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7726.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: wholeTextFiles not working with HDFS

Posted by littlebird <cx...@163.com>.

Hi, I have the same exception. Can you tell me how did you fix it? Thank you!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7665.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: wholeTextFiles not working with HDFS

Posted by Sguj <tp...@yahoo.com>.

My exception stack looks about the same.

java.io.FileNotFoundException: File /user/me/target/capacity-scheduler.xml
does not exist.
	at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
	at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
	at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:489)
	at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)
	at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)
	at
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:173)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1094)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:717)

I'm using Hadoop 1.2.1, and everything else I've tried in Spark with that
version has worked, so I doubt it's a version error.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7570.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: wholeTextFiles not working with HDFS

Posted by yinxusen <yi...@gmail.com>.

Hi Sguj,

Could you give me the exception stack?

I test it on my laptop and find that it gets the wrong FileSystem. It should
be DistributedFileSystem, but it finds the RawLocalFileSystem.

If we get the same exception stack, I'll try to fix it.

Here is my exception stack:

java.io.FileNotFoundException: File /sen/reuters-out/reut2-000.sgm-0.txt
does not exist.
        at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
        at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
        at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:489)
        at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)
        at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)
        at
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:173)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1097)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:728)

Besides, what's your hadoop version?




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7548.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: wholeTextFiles not working with HDFS

Posted by pierred <pi...@demartines.com>.

I forgot to say, I am using bin/spark-shell, spark-1.0.2
That host has scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_11)




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p12678.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: wholeTextFiles not working with HDFS

Posted by pierred <pi...@demartines.com>.

I had the same issue with spark-1.0.2-bin-hadoop*1*, and indeed the issue
seems related to Hadoop1.  When switching to using
spark-1.0.2-bin-hadoop*2*, the issue disappears.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p12677.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: wholeTextFiles not working with HDFS

Posted by kmader <ke...@gmail.com>.

That worked for me as well, I was using spark 1.0 compiled against Hadoop
1.0, switching to 1.0.1 compiled against hadoop 2



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p10547.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: wholeTextFiles not working with HDFS

Posted by kmader <ke...@gmail.com>.

I have the same issue

        val a = sc.textFile("s3n://MyBucket/MyFolder/*.tif")
        a.first

works perfectly fine, but 

        val d = sc.wholeTextFiles("s3n://MyBucket/MyFolder/*.tif")  does not
work
        d.first

Gives the following error message

        java.io.FileNotFoundException: File /MyBucket/MyFolder.tif does not
exist.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p10505.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.