You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Reminia Scarlet <re...@gmail.com> on 2016/05/27 09:53:45 UTC

problem about RDD map and then saveAsTextFile

Hi all:
 I’ve tried to execute something as below:

 result.map(transform).saveAsTextFile(hdfsAddress)

 Result is a RDD caluculated from mlilib algorithm.


I submit this to yarn, and after two attempts , the application failed.

But the exception in log is very missleading. It said  hdfsAddress already
exits.

Actually, the first attempt log showed that the exception is from the
calculation of

result. Though the attempt failed it created the file. And then attempt 2
began with

exception ‘file already exists’.


 Why was RDD calculation before already failed but also the file created?
That’s not so good I think.

Re: problem about RDD map and then saveAsTextFile

Posted by Christian Hellström <ps...@gmail.com>.
Internally, saveAsTextFile uses saveAsHadoopFile:
https://github.com/apache/spark/blob/d5911d1173fe0872f21cae6c47abf8ff479345a4/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
.

The final bit in the method first creates the output path and then saves
the data set. However, if there is an issue with the saveAsHadoopDataset
call, the path still remains. Technically, we could add an
exception-handling section that removes the path in case of problems. I
think that would be a nice way of making sure that we don’t litter the FS
with empty files and directories in case of exceptions.

So, to your question: parameter to saveAsTextFile is a path (not a file)
and it has to be empty. Spark automatically names the files PART-N with N
the partition number. This follows immediately from the partitioning scheme
of the RDD itself.

The real problem is that there is a problem with the calculation. You might
want to fix that first. Just post the relevant bits from the log.
Hi all:
 I’ve tried to execute something as below:

 result.map(transform).saveAsTextFile(hdfsAddress)

 Result is a RDD caluculated from mlilib algorithm.


I submit this to yarn, and after two attempts , the application failed.

But the exception in log is very missleading. It said  hdfsAddress already
exits.

Actually, the first attempt log showed that the exception is from the
calculation of

result. Though the attempt failed it created the file. And then attempt 2
began with

exception ‘file already exists’.


 Why was RDD calculation before already failed but also the file created?
That’s not so good I think.