You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Arun Luthra <ar...@gmail.com> on 2014/09/11 00:46:57 UTC

PrintWriter error in foreach

I have a spark program that worked in local mode, but throws an error in
yarn-client mode on a cluster. On the edge node in my home directory, I
have an output directory (called transout) which is ready to receive files.
The spark job I'm running is supposed to write a few hundred files into
that directory, once for each iteration of a foreach function. This works
in local mode, and my only guess as to why this would fail in yarn-client
mode is that the RDD is distributed across many nodes and the program is
trying to use the PrintWriter on the datanodes, where the output directory
doesn't exist. Is this what's happening? Any proposed solution?

abbreviation of the code:

import java.io.PrintWriter
...
rdd.foreach {
  val outFile = new PrintWriter("transoutput/output.%s".format(id))
  outFile.println("test")
  outFile.close()
}

Error:

14/09/10 16:57:09 WARN TaskSetManager: Lost TID 1826 (task 0.0:26)
14/09/10 16:57:09 WARN TaskSetManager: Loss was due to
java.io.FileNotFoundException
java.io.FileNotFoundException: transoutput/input.598718 (No such file or
directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
at java.io.FileOutputStream.<init>(FileOutputStream.java:84)
at java.io.PrintWriter.<init>(PrintWriter.java:146)
at
com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:98)
at
com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:95)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Re: PrintWriter error in foreach

Posted by Arun Luthra <ar...@gmail.com>.

Ok, so I don't think the workers on the data nodes will be able to see my
output directory on the edge node. I don't think stdout will work either,
so I'll write to HDFS via rdd.saveAsTextFile(...)

On Wed, Sep 10, 2014 at 3:51 PM, Daniil Osipov <da...@shazam.com>
wrote:

> Try providing full path to the file you want to write, and make sure the
> directory exists and is writable by the Spark process.
>
> On Wed, Sep 10, 2014 at 3:46 PM, Arun Luthra <ar...@gmail.com>
> wrote:
>
>> I have a spark program that worked in local mode, but throws an error in
>> yarn-client mode on a cluster. On the edge node in my home directory, I
>> have an output directory (called transout) which is ready to receive files.
>> The spark job I'm running is supposed to write a few hundred files into
>> that directory, once for each iteration of a foreach function. This works
>> in local mode, and my only guess as to why this would fail in yarn-client
>> mode is that the RDD is distributed across many nodes and the program is
>> trying to use the PrintWriter on the datanodes, where the output directory
>> doesn't exist. Is this what's happening? Any proposed solution?
>>
>> abbreviation of the code:
>>
>> import java.io.PrintWriter
>> ...
>> rdd.foreach {
>>   val outFile = new PrintWriter("transoutput/output.%s".format(id))
>>   outFile.println("test")
>>   outFile.close()
>> }
>>
>> Error:
>>
>> 14/09/10 16:57:09 WARN TaskSetManager: Lost TID 1826 (task 0.0:26)
>> 14/09/10 16:57:09 WARN TaskSetManager: Loss was due to
>> java.io.FileNotFoundException
>> java.io.FileNotFoundException: transoutput/input.598718 (No such file or
>> directory)
>> at java.io.FileOutputStream.open(Native Method)
>> at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
>> at java.io.FileOutputStream.<init>(FileOutputStream.java:84)
>> at java.io.PrintWriter.<init>(PrintWriter.java:146)
>> at
>> com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:98)
>> at
>> com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:95)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703)
>> at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703)
>> at
>> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
>> at
>> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>> at org.apache.spark.scheduler.Task.run(Task.scala:51)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> at java.lang.Thread.run(Thread.java:662)
>>
>
>

Re: PrintWriter error in foreach

Posted by Daniil Osipov <da...@shazam.com>.

Try providing full path to the file you want to write, and make sure the
directory exists and is writable by the Spark process.

On Wed, Sep 10, 2014 at 3:46 PM, Arun Luthra <ar...@gmail.com> wrote:

> I have a spark program that worked in local mode, but throws an error in
> yarn-client mode on a cluster. On the edge node in my home directory, I
> have an output directory (called transout) which is ready to receive files.
> The spark job I'm running is supposed to write a few hundred files into
> that directory, once for each iteration of a foreach function. This works
> in local mode, and my only guess as to why this would fail in yarn-client
> mode is that the RDD is distributed across many nodes and the program is
> trying to use the PrintWriter on the datanodes, where the output directory
> doesn't exist. Is this what's happening? Any proposed solution?
>
> abbreviation of the code:
>
> import java.io.PrintWriter
> ...
> rdd.foreach {
>   val outFile = new PrintWriter("transoutput/output.%s".format(id))
>   outFile.println("test")
>   outFile.close()
> }
>
> Error:
>
> 14/09/10 16:57:09 WARN TaskSetManager: Lost TID 1826 (task 0.0:26)
> 14/09/10 16:57:09 WARN TaskSetManager: Loss was due to
> java.io.FileNotFoundException
> java.io.FileNotFoundException: transoutput/input.598718 (No such file or
> directory)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
> at java.io.FileOutputStream.<init>(FileOutputStream.java:84)
> at java.io.PrintWriter.<init>(PrintWriter.java:146)
> at
> com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:98)
> at
> com.att.bdcoe.cip.ooh.TransformationLayer$$anonfun$main$3.apply(TransformLayer.scala:95)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703)
> at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:703)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
>