You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2015/10/19 20:48:28 UTC

[jira] [Resolved] (SPARK-4414) SparkContext.wholeTextFiles Doesn't work with S3 Buckets

     [ https://issues.apache.org/jira/browse/SPARK-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Rosen resolved SPARK-4414.
-------------------------------
    Resolution: Won't Fix

I'm going to resolve this as "Won't Fix", since I think that the difficultly / risk of fixing this in Spark is too high right now. While in principle we could fix this by inlining the affected Hadoop classes in Spark, it's going to be extremely difficult to do this in a way that is source- and binary-compatible with all of the Hadoop versions that we need to support.

Affected users should upgrade to Hadoop 1.2.1 or higher, which do not seem to be affected by this bug.

> SparkContext.wholeTextFiles Doesn't work with S3 Buckets
> --------------------------------------------------------
>
>                 Key: SPARK-4414
>                 URL: https://issues.apache.org/jira/browse/SPARK-4414
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Spark Core
>    Affects Versions: 1.1.0, 1.2.0
>            Reporter: Pedro Rodriguez
>            Assignee: Josh Rosen
>            Priority: Critical
>
> SparkContext.wholeTextFiles does not read files which SparkContext.textFile can read. Below are general steps to reproduce, my specific case is following that on a git repo.
> Steps to reproduce.
> 1. Create Amazon S3 bucket, make public with multiple files
> 2. Attempt to read bucket with
> sc.wholeTextFiles("s3n://mybucket/myfile.txt")
> 3. Spark returns the following error, even if the file exists.
> Exception in thread "main" java.io.FileNotFoundException: File does not exist: /myfile.txt
> 	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
> 	at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:489)
> 4. Change the call to
> sc.textFile("s3n://mybucket/myfile.txt")
> and there is no error message, the application should run fine.
> There is a question on StackOverflow as well on this:
> http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist
> This is link to repo/lines of code. The uncommented call doesn't work, the commented call works as expected:
> https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19
> It would be easy to use textFile with a multifile argument, but this should work correctly for s3 bucket files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org