You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mike Percy <mp...@apache.org> on 2014/01/10 02:29:53 UTC

Spark streaming on YARN?

After looking through the docs, grepping the commit logs and looking on the
list archives, I have been unable to see an indication or example of Spark
streaming working on YARN. Is this possible yet? So far, I've gotten at
least the Spark Pi example to run on YARN with CDH5 beta 1.

I am about to dig into the code and try to figure out how the batch Yarn
client works, to see how much work it would be to set up an AM to run an
InputDStream, but thought I'd make it easy on myself ask here first before
I got started.

Thanks in advance for any pointers,
Mike

Re: Spark streaming on YARN?

Posted by Mike Percy <mp...@apache.org>.
Great! I don't know if I'll have time to try it soon but good to know that
it's worth giving it another shot. Would be great if someone could report
in if they have been able to get it up and running.

Mike


On Tue, Jan 21, 2014 at 4:09 PM, Tathagata Das
<ta...@gmail.com>wrote:

> Actually, I just realised why it wasnt working. I just made some changes
> in the examples and last night and they are in the 0.9-rc4 that is being
> voted now. With those changes the streaming examples should work.
>
> The issue was that the thread running the main() of the examples was not
> waiting for the streaming context to continue processing. So the processing
> was being cancelled as soon as the streaming context was started. This has
> been fixed with the examples, and running the twitter examples should be
> the same as running the SparkPi example (though you have to set the
> twitter4j.oauth.* properties for the twitter example to work).
>
> TD
>
>
> On Tue, Jan 21, 2014 at 12:13 PM, Mike Percy <mp...@apache.org> wrote:
>
>> Just FYI I got it working in standalone mode with no probs, so thanks for
>> the help Tathagata. I was not able to get it working on YARN, I gave up.
>>
>> Thanks,
>> Mike
>>
>>
>> On Fri, Jan 10, 2014 at 6:17 PM, Tathagata Das <
>> tathagata.das1565@gmail.com> wrote:
>>
>>> Let me know if you have any trouble getting it running on the standalone
>>> mode. That will be easier for us to help out with.
>>>
>>> TD
>>>
>>>
>>> On Fri, Jan 10, 2014 at 5:47 PM, Mike Percy <mp...@apache.org> wrote:
>>>
>>>> On Fri, Jan 10, 2014 at 6:35 AM, Neal Sidhwaney <ne...@gmail.com>wrote:
>>>>
>>>>> Are you setting the SPARK_JAVA_OPTS environment variable? Under 0.8.1,
>>>>> we had to set that to -Dlog4j.configuration=<path to logging
>>>>> properties file>  to get STDOUT for workers.  This was in standalone
>>>>> mode, though not with YARN, so YMMV.
>>>>>  Neal
>>>>>
>>>>
>>>> Thanks for the idea Neal. The docs say that just putting it into conf/
>>>> should do the trick, but apparently that's not the case. I tried setting
>>>> that and it just stopped printing on the shell and had no effect on the
>>>> deployed container.
>>>>
>>>> Too much struggling with this, I'm going to work on getting this
>>>> running in standalone mode.
>>>>
>>>> Mike
>>>>
>>>>
>>>
>>
>

Re: Spark streaming on YARN?

Posted by Tathagata Das <ta...@gmail.com>.
Actually, I just realised why it wasnt working. I just made some changes in
the examples and last night and they are in the 0.9-rc4 that is being voted
now. With those changes the streaming examples should work.

The issue was that the thread running the main() of the examples was not
waiting for the streaming context to continue processing. So the processing
was being cancelled as soon as the streaming context was started. This has
been fixed with the examples, and running the twitter examples should be
the same as running the SparkPi example (though you have to set the
twitter4j.oauth.* properties for the twitter example to work).

TD


On Tue, Jan 21, 2014 at 12:13 PM, Mike Percy <mp...@apache.org> wrote:

> Just FYI I got it working in standalone mode with no probs, so thanks for
> the help Tathagata. I was not able to get it working on YARN, I gave up.
>
> Thanks,
> Mike
>
>
> On Fri, Jan 10, 2014 at 6:17 PM, Tathagata Das <
> tathagata.das1565@gmail.com> wrote:
>
>> Let me know if you have any trouble getting it running on the standalone
>> mode. That will be easier for us to help out with.
>>
>> TD
>>
>>
>> On Fri, Jan 10, 2014 at 5:47 PM, Mike Percy <mp...@apache.org> wrote:
>>
>>> On Fri, Jan 10, 2014 at 6:35 AM, Neal Sidhwaney <ne...@gmail.com>wrote:
>>>
>>>> Are you setting the SPARK_JAVA_OPTS environment variable? Under 0.8.1,
>>>> we had to set that to -Dlog4j.configuration=<path to logging
>>>> properties file>  to get STDOUT for workers.  This was in standalone
>>>> mode, though not with YARN, so YMMV.
>>>>  Neal
>>>>
>>>
>>> Thanks for the idea Neal. The docs say that just putting it into conf/
>>> should do the trick, but apparently that's not the case. I tried setting
>>> that and it just stopped printing on the shell and had no effect on the
>>> deployed container.
>>>
>>> Too much struggling with this, I'm going to work on getting this running
>>> in standalone mode.
>>>
>>> Mike
>>>
>>>
>>
>

Re: Spark streaming on YARN?

Posted by Mike Percy <mp...@apache.org>.
Just FYI I got it working in standalone mode with no probs, so thanks for
the help Tathagata. I was not able to get it working on YARN, I gave up.

Thanks,
Mike


On Fri, Jan 10, 2014 at 6:17 PM, Tathagata Das
<ta...@gmail.com>wrote:

> Let me know if you have any trouble getting it running on the standalone
> mode. That will be easier for us to help out with.
>
> TD
>
>
> On Fri, Jan 10, 2014 at 5:47 PM, Mike Percy <mp...@apache.org> wrote:
>
>> On Fri, Jan 10, 2014 at 6:35 AM, Neal Sidhwaney <ne...@gmail.com>wrote:
>>
>>> Are you setting the SPARK_JAVA_OPTS environment variable? Under 0.8.1,
>>> we had to set that to -Dlog4j.configuration=<path to logging properties
>>> file>  to get STDOUT for workers.  This was in standalone mode, though
>>> not with YARN, so YMMV.
>>>  Neal
>>>
>>
>> Thanks for the idea Neal. The docs say that just putting it into conf/
>> should do the trick, but apparently that's not the case. I tried setting
>> that and it just stopped printing on the shell and had no effect on the
>> deployed container.
>>
>> Too much struggling with this, I'm going to work on getting this running
>> in standalone mode.
>>
>> Mike
>>
>>
>

Re: Spark streaming on YARN?

Posted by Tathagata Das <ta...@gmail.com>.
Let me know if you have any trouble getting it running on the standalone
mode. That will be easier for us to help out with.

TD


On Fri, Jan 10, 2014 at 5:47 PM, Mike Percy <mp...@apache.org> wrote:

> On Fri, Jan 10, 2014 at 6:35 AM, Neal Sidhwaney <ne...@gmail.com> wrote:
>
>> Are you setting the SPARK_JAVA_OPTS environment variable? Under 0.8.1, we
>> had to set that to -Dlog4j.configuration=<path to logging properties
>> file>  to get STDOUT for workers.  This was in standalone mode, though
>> not with YARN, so YMMV.
>>  Neal
>>
>
> Thanks for the idea Neal. The docs say that just putting it into conf/
> should do the trick, but apparently that's not the case. I tried setting
> that and it just stopped printing on the shell and had no effect on the
> deployed container.
>
> Too much struggling with this, I'm going to work on getting this running
> in standalone mode.
>
> Mike
>
>

Re: Spark streaming on YARN?

Posted by Mike Percy <mp...@apache.org>.
On Fri, Jan 10, 2014 at 6:35 AM, Neal Sidhwaney <ne...@gmail.com> wrote:

> Are you setting the SPARK_JAVA_OPTS environment variable? Under 0.8.1, we
> had to set that to -Dlog4j.configuration=<path to logging properties file>
> to get STDOUT for workers.  This was in standalone mode, though not with
> YARN, so YMMV.
>  Neal
>

Thanks for the idea Neal. The docs say that just putting it into conf/
should do the trick, but apparently that's not the case. I tried setting
that and it just stopped printing on the shell and had no effect on the
deployed container.

Too much struggling with this, I'm going to work on getting this running in
standalone mode.

Mike

Re: Spark streaming on YARN?

Posted by Neal Sidhwaney <ne...@gmail.com>.
On Fri, Jan 10, 2014 at 6:04 AM, Mike Percy <mp...@apache.org> wrote:

>
> While it was difficult to get the logs from YARN before it deleted them
> during job cleanup, I finally did and all I got was this from stderr
> (stdout file was empty):
>
>
> Are you setting the SPARK_JAVA_OPTS environment variable? Under 0.8.1, we
had to set that to -Dlog4j.configuration=<path to logging properties file>
to get STDOUT for workers.  This was in standalone mode, though not with
YARN, so YMMV.
Neal

Re: Spark streaming on YARN?

Posted by Mike Percy <mp...@apache.org>.
On Fri, Jan 10, 2014 at 6:17 AM, Tom Graves <tg...@yahoo.com> wrote:

> Try the yarn-client mode, the yarn-client mode is much more like running
> on a standalone spark cluster. I successfully got the NetworkWordCount
> example to work with it.  We haven't tried making spark streaming run
> under yarn-standalone mode. When I had tried it a month or so it didn't
> work though.
>
> Tom
>

Thanks Tom. Tried this and it complained that it couldn't find the examples
JAR, seems like it was not being shipped. Weird...

Re: Spark streaming on YARN?

Posted by Tom Graves <tg...@yahoo.com>.
Try the yarn-client mode, the yarn-client mode is much more like running on a standalone spark cluster. I successfully got the NetworkWordCount example to work with it.  We haven't tried making spark streaming run under yarn-standalone mode. When I had tried it a month or so it didn't work though.   

Tom



On Friday, January 10, 2014 7:24 AM, Mike Percy <mp...@apache.org> wrote:
 
OK I got a bit farther but still no dice. I hacked up the HdfsWordCount.scala file to add a bunch of print statements output to a file basically. Apologies for the messy code, like I said I'm new to Scala. The file is here:

https://gist.github.com/mpercy/8351573


Apparently it's executing and getting somewhat far into the script, so I believe I'm actually passing the args correctly. This is the output I am getting from the above in /tmp/thing.txt (great name, I know):

hello world
still here: yarn-standalone, hdfs:///user/systest/hdfswordcount-test2
still here again ok: org.apache.spark.streaming.StreamingContext@7a237188
created stream: org.apache.spark.streaming.dstream.MappedDStream@6f4b06be
printing word counts: org.apache.spark.streaming.dstream.ShuffledDStream@5442e1da

So it looks like it hangs or gets killed when executing wordCounts.print() ... but that's all the info I've been able to glean so far.

I'm not sure I am catching the Throwable properly, if there is one (trying to get a stack trace). I wonder if it's getting kill -9ed by somebody... but I'd only expect that from YARN if there was an OutOfMemoryError and if that happened I think I should catch the OOME via my try block and be able to print, unless it threw again inside my catch maybe.

Any suggestions on where to go from here?

Thanks,
Mike



On Fri, Jan 10, 2014 at 3:04 AM, Mike Percy <mp...@apache.org> wrote:

Thanks Tathagata. That helped a lot, but I am having some trouble under YARN with the HdfsWordCount example.
>
>
>I was able to get the example to work locally, and was also able to submit the job to the YARN cluster, but it looks like it is crashing under YARN. The streaming job stops after about 30 seconds, right after it runs, and before I'm able to put anything new into the input directory. This is the command I am running on the command line:
>
>
>export HADOOP_CONF_DIR=/etc/hadoop/conf
>SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.2.0-cdh5.0.0-beta-1.jar ./spark-class org.apache.spark.deploy.yarn.Client --jar examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar --class org.apache.spark.streaming.examples.HdfsWordCount --args yarn-standalone --args hdfs:///user/mpercy/hdfswordcount-test2 --num-workers 3 --master-memory 4g --worker-memory 2g --worker-cores 1
>
>
>This is the kind of output I am getting in the YARN NodeManager log file:
>
>
>2014-01-09 20:13:29,249 INFO org.apache.spark.executor.CoarseGrainedExecutorBackend: Connecting to driver: akka://spark@sparktest-01:58117/user/CoarseGrainedScheduler
>2014-01-09 20:13:29,358 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 8 cluster_timestamp: 1389304540039 } attemptId: 1 } id: 4 } state: C_RUNNING diagnostics: "" exit_status: -1000
>2014-01-09 20:13:29,476 ERROR org.apache.spark.executor.CoarseGrainedExecutorBackend: Driver terminated or disconnected! Shutting down.
>2014-01-09 20:13:29,825 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1389304540039_0008_01_000004 is : 1
>2014-01-09 20:13:29,825 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1389304540039_0008_01_000004 and exit code: 1
>org.apache.hadoop.util.Shell$ExitCodeException: 
>        at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
>        at org.apache.hadoop.util.Shell.run(Shell.java:379)
>        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
>        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
>        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
>        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
>        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>        at java.lang.Thread.run(Thread.java:724)
>2014-01-09 20:13:29,825 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: 
>2014-01-09 20:13:29,825 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 1
>2014-01-09 20:13:29,826 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1389304540039_0008_01_000004 transitioned from RUNNING to EXITED_WITH_FAILURE
>
>
>While it was difficult to get the logs from YARN before it deleted them during job cleanup, I finally did and all I got was this from stderr (stdout file was empty):
>
>
>SLF4J: Class path contains multiple SLF4J bindings.
>SLF4J: Found binding in [jar:file:/media/ephemeral0/yarn/nm/usercache/mpercy/filecache/53/spark-assembly-0.8.1-incubating-hadoop2.2.0-cdh5.0.0-beta-1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>SLF4J: Found binding in [jar:file:/media/ephemeral0/yarn/nm/usercache/mpercy/filecache/52/spark-examples-assembly-0.8.1-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
>SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>
>
>Not super useful AFAICT, since I'm pretty sure SLF4J will pick the first binding so I doubt that was the cause of the crash. Any suggestions on how to proceed?
>
>
>One guess is that I am passing the args wrong. I'm new to Scala so I'm not sure whether I'm reading the ClientArguments code right, but based on the comments in one of the files I think passing --args multiple times is the right way to do it.
>
>
>And just for good measure, this is what is being executed by YARN's launch_container.sh script:
>
>
>exec /bin/bash -c "$JAVA_HOME/bin/java -server -Xmx4096m -Djava.io.tmpdir=$PWD/tmp org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.streaming.examples.HdfsWordCount --jar examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar --args  'yarn-standalone'  --args  'hdfs:///user/mpercy/hdfswordcount-test2'  --worker-memory 2048 --worker-cores 1 --num-workers 3 1> /var/log/hadoop-yarn/container/application_1389304540039_0052/container_1389304540039_0052_01_000001/stdout 2> /var/log/hadoop-yarn/container/application_1389304540039_0052/container_1389304540039_0052_01_000001/stderr
>
>
>
>Would love to hear any suggestions for how to debug this further!
>
>
>Thanks,
>Mike
>
>
>
>
>
>
>On Thu, Jan 9, 2014 at 5:44 PM, Tathagata Das <ta...@gmail.com> wrote:
>
>If you have been able to run Spark Pi to run on YARN, then you should be able to run the streaming example HdfsWordCount as well. Even though the instructions in the example says to run it on local machine, you can run the example on YARN as well in the same way as Spark PI. You would just have to give the appropriate Spark master url and use an HDFS directory as the 2nd parameter. Then any text file written to that HDFS directory will get "word counted". 
>>
>>
>>Note that you should write a file to that HDFS directory by moving the file from some other directory to that directory. For example if the HDFS directory that you want to use to run the example is hdfs://myhdfs:9000/mydir/ , then you can first copy a local file (say new_file) to "hdfs://myhdfs:9000/temp_location/new_file " then do a move it to "hdfs://myhdfs:9000/mydir/new_file". 
>>
>>
>>
>>
>>
>>
>>
>>On Thu, Jan 9, 2014 at 5:29 PM, Mike Percy <mp...@apache.org> wrote:
>>
>>After looking through the docs, grepping the commit logs and looking on the list archives, I have been unable to see an indication or example of Spark streaming working on YARN. Is this possible yet? So far, I've gotten at least the Spark Pi example to run on YARN with CDH5 beta 1.
>>>
>>>
>>>I am about to dig into the code and try to figure out how the batch Yarn client works, to see how much work it would be to set up an AM to run an InputDStream, but thought I'd make it easy on myself ask here first before I got started.
>>>
>>>
>>>Thanks in advance for any pointers,
>>>Mike
>>>
>>>
>>>
>>
>

Re: Spark streaming on YARN?

Posted by Mike Percy <mp...@apache.org>.
OK I got a bit farther but still no dice. I hacked up the
HdfsWordCount.scala file to add a bunch of print statements output to a
file basically. Apologies for the messy code, like I said I'm new to Scala.
The file is here:

https://gist.github.com/mpercy/8351573

Apparently it's executing and getting somewhat far into the script, so I
believe I'm actually passing the args correctly. This is the output I am
getting from the above in /tmp/thing.txt (great name, I know):

hello world
still here: yarn-standalone, hdfs:///user/systest/hdfswordcount-test2
still here again ok: org.apache.spark.streaming.StreamingContext@7a237188
created stream: org.apache.spark.streaming.dstream.MappedDStream@6f4b06be
printing word counts:
org.apache.spark.streaming.dstream.ShuffledDStream@5442e1da

So it looks like it hangs or gets killed when executing wordCounts.print()
... but that's all the info I've been able to glean so far.

I'm not sure I am catching the Throwable properly, if there is one (trying
to get a stack trace). I wonder if it's getting kill -9ed by somebody...
but I'd only expect that from YARN if there was an OutOfMemoryError and if
that happened I think I should catch the OOME via my try block and be able
to print, unless it threw again inside my catch maybe.

Any suggestions on where to go from here?

Thanks,
Mike


On Fri, Jan 10, 2014 at 3:04 AM, Mike Percy <mp...@apache.org> wrote:

> Thanks Tathagata. That helped a lot, but I am having some trouble under
> YARN with the HdfsWordCount example.
>
> I was able to get the example to work locally, and was also able to submit
> the job to the YARN cluster, but it looks like it is crashing under YARN.
> The streaming job stops after about 30 seconds, right after it runs, and
> before I'm able to put anything new into the input directory. This is the
> command I am running on the command line:
>
> export HADOOP_CONF_DIR=/etc/hadoop/conf
> SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.2.0-cdh5.0.0-beta-1.jar
> ./spark-class org.apache.spark.deploy.yarn.Client --jar
> examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar
> --class org.apache.spark.streaming.examples.HdfsWordCount --args
> yarn-standalone --args hdfs:///user/mpercy/hdfswordcount-test2
> --num-workers 3 --master-memory 4g --worker-memory 2g --worker-cores 1
>
> This is the kind of output I am getting in the YARN NodeManager log file:
>
> 2014-01-09 20:13:29,249 INFO
> org.apache.spark.executor.CoarseGrainedExecutorBackend: Connecting to
> driver: akka://spark@sparktest-01:58117/user/CoarseGrainedScheduler
> 2014-01-09 20:13:29,358 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 8 cluster_timestamp: 1389304540039 } attemptId: 1 } id: 4 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2014-01-09 20:13:29,476 ERROR
> org.apache.spark.executor.CoarseGrainedExecutorBackend: Driver terminated
> or disconnected! Shutting down.
> 2014-01-09 20:13:29,825 WARN
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit
> code from container container_1389304540039_0008_01_000004 is : 1
> 2014-01-09 20:13:29,825 WARN
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
> Exception from container-launch with container ID:
> container_1389304540039_0008_01_000004 and exit code: 1
> org.apache.hadoop.util.Shell$ExitCodeException:
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
>         at org.apache.hadoop.util.Shell.run(Shell.java:379)
>         at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
>         at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
>         at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
>         at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:724)
> 2014-01-09 20:13:29,825 INFO
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
> 2014-01-09 20:13:29,825 WARN
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
> Container exited with a non-zero exit code 1
> 2014-01-09 20:13:29,826 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container container_1389304540039_0008_01_000004 transitioned from RUNNING
> to EXITED_WITH_FAILURE
>
> While it was difficult to get the logs from YARN before it deleted them
> during job cleanup, I finally did and all I got was this from stderr
> (stdout file was empty):
>
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/media/ephemeral0/yarn/nm/usercache/mpercy/filecache/53/spark-assembly-0.8.1-incubating-hadoop2.2.0-cdh5.0.0-beta-1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/media/ephemeral0/yarn/nm/usercache/mpercy/filecache/52/spark-examples-assembly-0.8.1-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>
> Not super useful AFAICT, since I'm pretty sure SLF4J will pick the first
> binding so I doubt that was the cause of the crash. Any suggestions on how
> to proceed?
>
> One guess is that I am passing the args wrong. I'm new to Scala so I'm not
> sure whether I'm reading the ClientArguments code right, but based on the
> comments in one of the files I think passing --args multiple times is the
> right way to do it.
>
> And just for good measure, this is what is being executed by YARN's
> launch_container.sh script:
>
> exec /bin/bash -c "$JAVA_HOME/bin/java -server -Xmx4096m
> -Djava.io.tmpdir=$PWD/tmp org.apache.spark.deploy.yarn.ApplicationMaster
> --class org.apache.spark.streaming.examples.HdfsWordCount --jar
> examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar
> --args  'yarn-standalone'  --args
>  'hdfs:///user/mpercy/hdfswordcount-test2'  --worker-memory 2048
> --worker-cores 1 --num-workers 3 1>
> /var/log/hadoop-yarn/container/application_1389304540039_0052/container_1389304540039_0052_01_000001/stdout
> 2>
> /var/log/hadoop-yarn/container/application_1389304540039_0052/container_1389304540039_0052_01_000001/stderr
>
> Would love to hear any suggestions for how to debug this further!
>
> Thanks,
> Mike
>
>
>
> On Thu, Jan 9, 2014 at 5:44 PM, Tathagata Das <tathagata.das1565@gmail.com
> > wrote:
>
>> If you have been able to run Spark Pi to run on YARN, then you should be
>> able to run the streaming example HdfsWordCount<https://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/HdfsWordCount.scala> as
>> well. Even though the instructions in the example says to run it on local
>> machine, you can run the example on YARN as well in the same way as Spark
>> PI. You would just have to give the appropriate Spark master url and use an
>> HDFS directory as the 2nd parameter. Then any text file written to that
>> HDFS directory will get "word counted".
>>
>> Note that you should write a file to that HDFS directory by moving the
>> file from some other directory to that directory. For example if the HDFS
>> directory that you want to use to run the example is
>> *hdfs://myhdfs:9000/mydir/* , then you can first copy a local file (say
>> new_file) to "*hdfs://myhdfs:9000/temp_location/new_file *" then do a
>> move it to "*hdfs://myhdfs:9000/mydir/new_file*".
>>
>>
>>
>>
>> On Thu, Jan 9, 2014 at 5:29 PM, Mike Percy <mp...@apache.org> wrote:
>>
>>> After looking through the docs, grepping the commit logs and looking on
>>> the list archives, I have been unable to see an indication or example of
>>> Spark streaming working on YARN. Is this possible yet? So far, I've gotten
>>> at least the Spark Pi example to run on YARN with CDH5 beta 1.
>>>
>>> I am about to dig into the code and try to figure out how the batch Yarn
>>> client works, to see how much work it would be to set up an AM to run an
>>> InputDStream, but thought I'd make it easy on myself ask here first before
>>> I got started.
>>>
>>> Thanks in advance for any pointers,
>>> Mike
>>>
>>>
>>
>

Re: Spark streaming on YARN?

Posted by Mike Percy <mp...@apache.org>.
Thanks Tathagata. That helped a lot, but I am having some trouble under
YARN with the HdfsWordCount example.

I was able to get the example to work locally, and was also able to submit
the job to the YARN cluster, but it looks like it is crashing under YARN.
The streaming job stops after about 30 seconds, right after it runs, and
before I'm able to put anything new into the input directory. This is the
command I am running on the command line:

export HADOOP_CONF_DIR=/etc/hadoop/conf
SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.2.0-cdh5.0.0-beta-1.jar
./spark-class org.apache.spark.deploy.yarn.Client --jar
examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar
--class org.apache.spark.streaming.examples.HdfsWordCount --args
yarn-standalone --args hdfs:///user/mpercy/hdfswordcount-test2
--num-workers 3 --master-memory 4g --worker-memory 2g --worker-cores 1

This is the kind of output I am getting in the YARN NodeManager log file:

2014-01-09 20:13:29,249 INFO
org.apache.spark.executor.CoarseGrainedExecutorBackend: Connecting to
driver: akka://spark@sparktest-01:58117/user/CoarseGrainedScheduler
2014-01-09 20:13:29,358 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 8 cluster_timestamp: 1389304540039 } attemptId: 1 } id: 4 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2014-01-09 20:13:29,476 ERROR
org.apache.spark.executor.CoarseGrainedExecutorBackend: Driver terminated
or disconnected! Shutting down.
2014-01-09 20:13:29,825 WARN
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit
code from container container_1389304540039_0008_01_000004 is : 1
2014-01-09 20:13:29,825 WARN
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Exception from container-launch with container ID:
container_1389304540039_0008_01_000004 and exit code: 1
org.apache.hadoop.util.Shell$ExitCodeException:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
        at org.apache.hadoop.util.Shell.run(Shell.java:379)
        at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
        at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
        at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
        at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
2014-01-09 20:13:29,825 INFO
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
2014-01-09 20:13:29,825 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Container exited with a non-zero exit code 1
2014-01-09 20:13:29,826 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1389304540039_0008_01_000004 transitioned from RUNNING
to EXITED_WITH_FAILURE

While it was difficult to get the logs from YARN before it deleted them
during job cleanup, I finally did and all I got was this from stderr
(stdout file was empty):

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/media/ephemeral0/yarn/nm/usercache/mpercy/filecache/53/spark-assembly-0.8.1-incubating-hadoop2.2.0-cdh5.0.0-beta-1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/media/ephemeral0/yarn/nm/usercache/mpercy/filecache/52/spark-examples-assembly-0.8.1-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

Not super useful AFAICT, since I'm pretty sure SLF4J will pick the first
binding so I doubt that was the cause of the crash. Any suggestions on how
to proceed?

One guess is that I am passing the args wrong. I'm new to Scala so I'm not
sure whether I'm reading the ClientArguments code right, but based on the
comments in one of the files I think passing --args multiple times is the
right way to do it.

And just for good measure, this is what is being executed by YARN's
launch_container.sh script:

exec /bin/bash -c "$JAVA_HOME/bin/java -server -Xmx4096m
-Djava.io.tmpdir=$PWD/tmp org.apache.spark.deploy.yarn.ApplicationMaster
--class org.apache.spark.streaming.examples.HdfsWordCount --jar
examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar
--args  'yarn-standalone'  --args
 'hdfs:///user/mpercy/hdfswordcount-test2'  --worker-memory 2048
--worker-cores 1 --num-workers 3 1>
/var/log/hadoop-yarn/container/application_1389304540039_0052/container_1389304540039_0052_01_000001/stdout
2>
/var/log/hadoop-yarn/container/application_1389304540039_0052/container_1389304540039_0052_01_000001/stderr

Would love to hear any suggestions for how to debug this further!

Thanks,
Mike



On Thu, Jan 9, 2014 at 5:44 PM, Tathagata Das
<ta...@gmail.com>wrote:

> If you have been able to run Spark Pi to run on YARN, then you should be
> able to run the streaming example HdfsWordCount<https://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/HdfsWordCount.scala> as
> well. Even though the instructions in the example says to run it on local
> machine, you can run the example on YARN as well in the same way as Spark
> PI. You would just have to give the appropriate Spark master url and use an
> HDFS directory as the 2nd parameter. Then any text file written to that
> HDFS directory will get "word counted".
>
> Note that you should write a file to that HDFS directory by moving the
> file from some other directory to that directory. For example if the HDFS
> directory that you want to use to run the example is
> *hdfs://myhdfs:9000/mydir/* , then you can first copy a local file (say
> new_file) to "*hdfs://myhdfs:9000/temp_location/new_file *" then do a
> move it to "*hdfs://myhdfs:9000/mydir/new_file*".
>
>
>
>
> On Thu, Jan 9, 2014 at 5:29 PM, Mike Percy <mp...@apache.org> wrote:
>
>> After looking through the docs, grepping the commit logs and looking on
>> the list archives, I have been unable to see an indication or example of
>> Spark streaming working on YARN. Is this possible yet? So far, I've gotten
>> at least the Spark Pi example to run on YARN with CDH5 beta 1.
>>
>> I am about to dig into the code and try to figure out how the batch Yarn
>> client works, to see how much work it would be to set up an AM to run an
>> InputDStream, but thought I'd make it easy on myself ask here first before
>> I got started.
>>
>> Thanks in advance for any pointers,
>> Mike
>>
>>
>

Re: Spark streaming on YARN?

Posted by Tathagata Das <ta...@gmail.com>.
If you have been able to run Spark Pi to run on YARN, then you should be
able to run the streaming example
HdfsWordCount<https://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/HdfsWordCount.scala>
as
well. Even though the instructions in the example says to run it on local
machine, you can run the example on YARN as well in the same way as Spark
PI. You would just have to give the appropriate Spark master url and use an
HDFS directory as the 2nd parameter. Then any text file written to that
HDFS directory will get "word counted".

Note that you should write a file to that HDFS directory by moving the file
from some other directory to that directory. For example if the HDFS
directory that you want to use to run the example is
*hdfs://myhdfs:9000/mydir/* , then you can first copy a local file (say
new_file) to "*hdfs://myhdfs:9000/temp_location/new_file *" then do a move
it to "*hdfs://myhdfs:9000/mydir/new_file*".




On Thu, Jan 9, 2014 at 5:29 PM, Mike Percy <mp...@apache.org> wrote:

> After looking through the docs, grepping the commit logs and looking on
> the list archives, I have been unable to see an indication or example of
> Spark streaming working on YARN. Is this possible yet? So far, I've gotten
> at least the Spark Pi example to run on YARN with CDH5 beta 1.
>
> I am about to dig into the code and try to figure out how the batch Yarn
> client works, to see how much work it would be to set up an AM to run an
> InputDStream, but thought I'd make it easy on myself ask here first before
> I got started.
>
> Thanks in advance for any pointers,
> Mike
>
>