You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/01/06 11:00:49 UTC
[jira] [Resolved] (SPARK-4624) Errors when reading/writtign to S3
large object files
[ https://issues.apache.org/jira/browse/SPARK-4624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-4624.
------------------------------
Resolution: Cannot Reproduce
This sounds like an S3 issue, but reopen if you can still reproduce and have more specific info about how to do it, and what exactly happens.
> Errors when reading/writtign to S3 large object files
> -----------------------------------------------------
>
> Key: SPARK-4624
> URL: https://issues.apache.org/jira/browse/SPARK-4624
> Project: Spark
> Issue Type: Bug
> Components: EC2, Input/Output, Mesos
> Affects Versions: 1.1.0
> Environment: manually setup Mesos cluster in EC2 made of 30 c3.4xLArge Nodes
> Reporter: Kriton Tsintaris
> Priority: Critical
>
> My cluster is not configured to use hdfs. Instead the local disk of each node is used.
> I've got a number of huge RDD object files (each made of ~600 part files each of ~60 GB). They are updated extremely rarely.
> An example of the model of the data stored in these RDDs is the following: (Long, Array[Long]).
> When I load them to my cluster, using val page_users = sc.objectFile[(Long,Array[Long])]("s3n://mybucket/path/myrdd.obj.rdd") or equivelant, sometimes data is missing (as if 1 or 2 of the part files was not sucesfuly loaded).
> What is more frustrating is that I get no errors that this has happened! Sometimes reading s3 timeouts or gets some errors but eventually auto-retries do succeed.
> Furthermore If I attempt to write an RDD back into S3, using myrdd.saveAsObjectFile("s3n://..."), the operation will again terminate before it was completed without any warning or indication of error.
> More specifically what will happen is that the object files parts will be left under a _temporary folder and only a few of them will have been moved in the correct "path" in s3. This only happens when I am writing huge object files. If my object file is just a few GB everything will be fine.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org