You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kriton Tsintaris (JIRA)" <ji...@apache.org> on 2014/11/26 14:07:13 UTC

[jira] [Created] (SPARK-4624) Errors when reading/writtign to S3 large object files

Kriton Tsintaris created SPARK-4624:
---------------------------------------

             Summary: Errors when reading/writtign to S3 large object files
                 Key: SPARK-4624
                 URL: https://issues.apache.org/jira/browse/SPARK-4624
             Project: Spark
          Issue Type: Bug
          Components: EC2, Input/Output, Mesos
    Affects Versions: 1.1.0
         Environment: manually setup Mesos cluster in EC2 made of 30 c3.4xLArge Nodes
            Reporter: Kriton Tsintaris
            Priority: Critical


My cluster is not configured to use hdfs. Instead the local disk of each node is used.

I've got a number of huge RDD object files (each made of ~600 part files each of ~60 GB). They are updated extremely rarely.

An example of the model of the data stored in these RDDs is the following: (Long, Array[Long]). 

When I load them to my cluster, using val page_users = sc.objectFile[(Long,Array[Long])]("s3n://mybucket/path/myrdd.obj.rdd") or equivelant, sometimes data is missing (as if 1 or 2 of the part files was not sucesfuly loaded).
What is more frustrating is that I get no errors that this has happened! Sometimes reading s3 timeouts or gets some errors but eventually auto-retries do succeed.

Furthermore If I attempt to write an RDD back into S3, using myrdd.saveAsObjectFile("s3n://..."), the operation will again terminate before it was completed without any warning or indication of error.
More specifically what will happen is that the object files parts will be left under a _temporary folder and only a few of them will have been moved in the correct "path" in s3. This only happens when I am writing huge object files. If my object file is just a few GB everything will be fine. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org