You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Matt Work Coarr <ma...@gmail.com> on 2014/07/15 23:43:49 UTC

can't get jobs to run on cluster (enough memory and cpus are available on worker)

Hello spark folks,

I have a simple spark cluster setup but I can't get jobs to run on it.  I
am using the standlone mode.

One master, one slave.  Both machines have 32GB ram and 8 cores.

The slave is setup with one worker that has 8 cores and 24GB memory
allocated.

My application requires 2 cores and 5GB of memory.

However, I'm getting the following error:

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have sufficient
memory

What else should I check for?

This is a simplified setup (the real cluster has 20 nodes).  In this
simplified setup I am running the master and the slave manually.  The
master's web page shows the worker and it shows the application and the
memory/core requirements match what I mentioned above.

I also tried running the SparkPi example via bin/run-example and get the
same result.  It requires 8 cores and 512MB of memory, which is also
clearly within the limits of the available worker.

Any ideas would be greatly appreciated!!

Matt

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Posted by Matt Work Coarr <ma...@gmail.com>.

I got this working by having our sysadmin update our security group to
allow incoming traffic from the local subnet on ports 10000-65535.  I'm not
sure if there's a more specific range I could have used, but so far,
everything is running!

Thanks for all the responses Marcelo and Andrew!!

Matt


On Thu, Jul 17, 2014 at 9:10 PM, Andrew Or <an...@databricks.com> wrote:

> Hi Matt,
>
> The security group shouldn't be an issue; the ports listed in
> `spark_ec2.py` are only for communication with the outside world.
>
> How did you launch your application? I notice you did not launch your
> driver from your Master node. What happens if you did? Another thing is
> that there seems to be some inconsistency or missing pieces in the logs you
> posted. After an executor says "driver disassociated," what happens in the
> driver logs? Is an exception thrown or something?
>
> It would be useful if you could also post your conf/spark-env.sh.
>
> Andrew
>
>
> 2014-07-17 14:11 GMT-07:00 Marcelo Vanzin <va...@cloudera.com>:
>
> Hi Matt,
>>
>> I'm not very familiar with setup on ec2; the closest I can point you
>> at is to look at the "launch_cluster" in ec2/spark_ec2.py, where the
>> ports seem to be configured.
>>
>>
>> On Thu, Jul 17, 2014 at 1:29 PM, Matt Work Coarr
>> <ma...@gmail.com> wrote:
>> > Thanks Marcelo!  This is a huge help!!
>> >
>> > Looking at the executor logs (in a vanilla spark install, I'm finding
>> them
>> > in $SPARK_HOME/work/*)...
>> >
>> > It launches the executor, but it looks like the
>> CoarseGrainedExecutorBackend
>> > is having trouble talking to the driver (exactly what you said!!!).
>> >
>> > Do you know what the range of random ports that is used for the the
>> > executor-to-driver?  Is that range adjustable?  Any config setting or
>> > environment variable?
>> >
>> > I manually setup my ec2 security group to include all the ports that the
>> > spark ec2 script ($SPARK_HOME/ec2/spark_ec2.py) sets up in it's security
>> > groups.  They included (for those listed above 10000):
>> > 19999
>> > 50060
>> > 50070
>> > 50075
>> > 60060
>> > 60070
>> > 60075
>> >
>> > Obviously I'll need to make some adjustments to my EC2 security group!
>>  Just
>> > need to figure out exactly what should be in there.  To keep things
>> simple,
>> > I just have one security group for the master, slaves, and the driver
>> > machine.
>> >
>> > In listing the port ranges in my current security group I looked at the
>> > ports that spark_ec2.py sets up as well as the ports listed in the
>> "spark
>> > standalone mode" documentation page under "configuring ports for network
>> > security":
>> >
>> > http://spark.apache.org/docs/latest/spark-standalone.html
>> >
>> >
>> > Here are the relevant fragments from the executor log:
>> >
>> > Spark Executor Command: "/cask/jdk/bin/java" "-cp"
>> >
>> "::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.
>> >
>> >
>> 2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar"
>> > "-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100" "-Dspark.akka.
>> >
>> > frameSize=100" "-Xms512M" "-Xmx512M"
>> > "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>> > "akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra
>> >
>> > inedScheduler" "0" "ip-10-202-8-45.ec2.internal" "8"
>> > "akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker"
>> > "app-20140717195146-0000"
>> >
>> > ========================================
>> >
>> > ...
>> >
>> > 14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the
>> custom-built
>> > native-hadoop library...
>> >
>> > 14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop
>> with
>> > error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
>> >
>> > 14/07/17 19:51:47 DEBUG NativeCodeLoader:
>> >
>> java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
>> >
>> > 14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop
>> > library for your platform... using builtin-java classes where applicable
>> >
>> > 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Falling
>> back
>> > to shell based
>> >
>> > 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Group
>> mapping
>> > impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping
>> >
>> > 14/07/17 19:51:48 DEBUG Groups: Group mapping
>> > impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback;
>> > cacheTimeout=300000
>> >
>> > 14/07/17 19:51:48 DEBUG SparkHadoopUtil: running as user: ec2-user
>> >
>> > ...
>> >
>> >
>> > 14/07/17 19:51:48 INFO CoarseGrainedExecutorBackend: Connecting to
>> driver:
>> > akka.tcp://spark@ip-10-202-11-191.ec2.internal
>> :46787/user/CoarseGrainedScheduler
>> >
>> > 14/07/17 19:51:48 INFO WorkerWatcher: Connecting to worker
>> > akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
>> >
>> > 14/07/17 19:51:49 INFO WorkerWatcher: Successfully connected to
>> > akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
>> >
>> > 14/07/17 19:53:29 ERROR CoarseGrainedExecutorBackend: Driver
>> Disassociated
>> > [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:55670] ->
>> > [akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787] disassociated!
>> > Shutting down.
>> >
>> >
>> > Thanks a bunch!
>> > Matt
>> >
>> >
>> > On Thu, Jul 17, 2014 at 1:21 PM, Marcelo Vanzin <va...@cloudera.com>
>> wrote:
>> >>
>> >> When I meant the executor log, I meant the log of the process launched
>> >> by the worker, not the worker. In my CDH-based Spark install, those
>> >> end up in /var/run/spark/work.
>> >>
>> >> If you look at your worker log, you'll see it's launching the executor
>> >> process. So there should be something there.
>> >>
>> >> Since you say it works when both are run in the same node, that
>> >> probably points to some communication issue, since the executor needs
>> >> to connect back to the driver. Check to see if you don't have any
>> >> firewalls blocking the ports Spark tries to use. (That's one of the
>> >> non-resource-related cases that will cause that message.)
>>
>>
>>
>> --
>> Marcelo
>>
>
>

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Posted by Andrew Or <an...@databricks.com>.

Hi Matt,

The security group shouldn't be an issue; the ports listed in
`spark_ec2.py` are only for communication with the outside world.

How did you launch your application? I notice you did not launch your
driver from your Master node. What happens if you did? Another thing is
that there seems to be some inconsistency or missing pieces in the logs you
posted. After an executor says "driver disassociated," what happens in the
driver logs? Is an exception thrown or something?

It would be useful if you could also post your conf/spark-env.sh.

Andrew


2014-07-17 14:11 GMT-07:00 Marcelo Vanzin <va...@cloudera.com>:

> Hi Matt,
>
> I'm not very familiar with setup on ec2; the closest I can point you
> at is to look at the "launch_cluster" in ec2/spark_ec2.py, where the
> ports seem to be configured.
>
>
> On Thu, Jul 17, 2014 at 1:29 PM, Matt Work Coarr
> <ma...@gmail.com> wrote:
> > Thanks Marcelo!  This is a huge help!!
> >
> > Looking at the executor logs (in a vanilla spark install, I'm finding
> them
> > in $SPARK_HOME/work/*)...
> >
> > It launches the executor, but it looks like the
> CoarseGrainedExecutorBackend
> > is having trouble talking to the driver (exactly what you said!!!).
> >
> > Do you know what the range of random ports that is used for the the
> > executor-to-driver?  Is that range adjustable?  Any config setting or
> > environment variable?
> >
> > I manually setup my ec2 security group to include all the ports that the
> > spark ec2 script ($SPARK_HOME/ec2/spark_ec2.py) sets up in it's security
> > groups.  They included (for those listed above 10000):
> > 19999
> > 50060
> > 50070
> > 50075
> > 60060
> > 60070
> > 60075
> >
> > Obviously I'll need to make some adjustments to my EC2 security group!
>  Just
> > need to figure out exactly what should be in there.  To keep things
> simple,
> > I just have one security group for the master, slaves, and the driver
> > machine.
> >
> > In listing the port ranges in my current security group I looked at the
> > ports that spark_ec2.py sets up as well as the ports listed in the "spark
> > standalone mode" documentation page under "configuring ports for network
> > security":
> >
> > http://spark.apache.org/docs/latest/spark-standalone.html
> >
> >
> > Here are the relevant fragments from the executor log:
> >
> > Spark Executor Command: "/cask/jdk/bin/java" "-cp"
> >
> "::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.
> >
> >
> 2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar"
> > "-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100" "-Dspark.akka.
> >
> > frameSize=100" "-Xms512M" "-Xmx512M"
> > "org.apache.spark.executor.CoarseGrainedExecutorBackend"
> > "akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra
> >
> > inedScheduler" "0" "ip-10-202-8-45.ec2.internal" "8"
> > "akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker"
> > "app-20140717195146-0000"
> >
> > ========================================
> >
> > ...
> >
> > 14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the custom-built
> > native-hadoop library...
> >
> > 14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop
> with
> > error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
> >
> > 14/07/17 19:51:47 DEBUG NativeCodeLoader:
> >
> java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
> >
> > 14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop
> > library for your platform... using builtin-java classes where applicable
> >
> > 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Falling
> back
> > to shell based
> >
> > 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Group
> mapping
> > impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping
> >
> > 14/07/17 19:51:48 DEBUG Groups: Group mapping
> > impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback;
> > cacheTimeout=300000
> >
> > 14/07/17 19:51:48 DEBUG SparkHadoopUtil: running as user: ec2-user
> >
> > ...
> >
> >
> > 14/07/17 19:51:48 INFO CoarseGrainedExecutorBackend: Connecting to
> driver:
> > akka.tcp://spark@ip-10-202-11-191.ec2.internal
> :46787/user/CoarseGrainedScheduler
> >
> > 14/07/17 19:51:48 INFO WorkerWatcher: Connecting to worker
> > akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
> >
> > 14/07/17 19:51:49 INFO WorkerWatcher: Successfully connected to
> > akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
> >
> > 14/07/17 19:53:29 ERROR CoarseGrainedExecutorBackend: Driver
> Disassociated
> > [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:55670] ->
> > [akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787] disassociated!
> > Shutting down.
> >
> >
> > Thanks a bunch!
> > Matt
> >
> >
> > On Thu, Jul 17, 2014 at 1:21 PM, Marcelo Vanzin <va...@cloudera.com>
> wrote:
> >>
> >> When I meant the executor log, I meant the log of the process launched
> >> by the worker, not the worker. In my CDH-based Spark install, those
> >> end up in /var/run/spark/work.
> >>
> >> If you look at your worker log, you'll see it's launching the executor
> >> process. So there should be something there.
> >>
> >> Since you say it works when both are run in the same node, that
> >> probably points to some communication issue, since the executor needs
> >> to connect back to the driver. Check to see if you don't have any
> >> firewalls blocking the ports Spark tries to use. (That's one of the
> >> non-resource-related cases that will cause that message.)
>
>
>
> --
> Marcelo
>

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Posted by Marcelo Vanzin <va...@cloudera.com>.

Hi Matt,

I'm not very familiar with setup on ec2; the closest I can point you
at is to look at the "launch_cluster" in ec2/spark_ec2.py, where the
ports seem to be configured.


On Thu, Jul 17, 2014 at 1:29 PM, Matt Work Coarr
<ma...@gmail.com> wrote:
> Thanks Marcelo!  This is a huge help!!
>
> Looking at the executor logs (in a vanilla spark install, I'm finding them
> in $SPARK_HOME/work/*)...
>
> It launches the executor, but it looks like the CoarseGrainedExecutorBackend
> is having trouble talking to the driver (exactly what you said!!!).
>
> Do you know what the range of random ports that is used for the the
> executor-to-driver?  Is that range adjustable?  Any config setting or
> environment variable?
>
> I manually setup my ec2 security group to include all the ports that the
> spark ec2 script ($SPARK_HOME/ec2/spark_ec2.py) sets up in it's security
> groups.  They included (for those listed above 10000):
> 19999
> 50060
> 50070
> 50075
> 60060
> 60070
> 60075
>
> Obviously I'll need to make some adjustments to my EC2 security group!  Just
> need to figure out exactly what should be in there.  To keep things simple,
> I just have one security group for the master, slaves, and the driver
> machine.
>
> In listing the port ranges in my current security group I looked at the
> ports that spark_ec2.py sets up as well as the ports listed in the "spark
> standalone mode" documentation page under "configuring ports for network
> security":
>
> http://spark.apache.org/docs/latest/spark-standalone.html
>
>
> Here are the relevant fragments from the executor log:
>
> Spark Executor Command: "/cask/jdk/bin/java" "-cp"
> "::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.
>
> 2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar"
> "-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100" "-Dspark.akka.
>
> frameSize=100" "-Xms512M" "-Xmx512M"
> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
> "akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra
>
> inedScheduler" "0" "ip-10-202-8-45.ec2.internal" "8"
> "akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker"
> "app-20140717195146-0000"
>
> ========================================
>
> ...
>
> 14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the custom-built
> native-hadoop library...
>
> 14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop with
> error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
>
> 14/07/17 19:51:47 DEBUG NativeCodeLoader:
> java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
>
> 14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Falling back
> to shell based
>
> 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Group mapping
> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping
>
> 14/07/17 19:51:48 DEBUG Groups: Group mapping
> impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback;
> cacheTimeout=300000
>
> 14/07/17 19:51:48 DEBUG SparkHadoopUtil: running as user: ec2-user
>
> ...
>
>
> 14/07/17 19:51:48 INFO CoarseGrainedExecutorBackend: Connecting to driver:
> akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGrainedScheduler
>
> 14/07/17 19:51:48 INFO WorkerWatcher: Connecting to worker
> akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
>
> 14/07/17 19:51:49 INFO WorkerWatcher: Successfully connected to
> akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
>
> 14/07/17 19:53:29 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
> [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:55670] ->
> [akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787] disassociated!
> Shutting down.
>
>
> Thanks a bunch!
> Matt
>
>
> On Thu, Jul 17, 2014 at 1:21 PM, Marcelo Vanzin <va...@cloudera.com> wrote:
>>
>> When I meant the executor log, I meant the log of the process launched
>> by the worker, not the worker. In my CDH-based Spark install, those
>> end up in /var/run/spark/work.
>>
>> If you look at your worker log, you'll see it's launching the executor
>> process. So there should be something there.
>>
>> Since you say it works when both are run in the same node, that
>> probably points to some communication issue, since the executor needs
>> to connect back to the driver. Check to see if you don't have any
>> firewalls blocking the ports Spark tries to use. (That's one of the
>> non-resource-related cases that will cause that message.)



-- 
Marcelo

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Posted by Matt Work Coarr <ma...@gmail.com>.

Thanks Marcelo!  This is a huge help!!

Looking at the executor logs (in a vanilla spark install, I'm finding them
in $SPARK_HOME/work/*)...

It launches the executor, but it looks like the
CoarseGrainedExecutorBackend is having trouble talking to the driver
(exactly what you said!!!).

Do you know what the range of random ports that is used for the the
executor-to-driver?  Is that range adjustable?  Any config setting or
environment variable?

I manually setup my ec2 security group to include all the ports that the
spark ec2 script ($SPARK_HOME/ec2/spark_ec2.py) sets up in it's security
groups.  They included (for those listed above 10000):
19999
50060
50070
50075
60060
60070
60075

Obviously I'll need to make some adjustments to my EC2 security group!
 Just need to figure out exactly what should be in there.  To keep things
simple, I just have one security group for the master, slaves, and the
driver machine.

In listing the port ranges in my current security group I looked at the
ports that spark_ec2.py sets up as well as the ports listed in the "spark
standalone mode" documentation page under "configuring ports for network
security":

http://spark.apache.org/docs/latest/spark-standalone.html

Here are the relevant fragments from the executor log:

Spark Executor Command: "/cask/jdk/bin/java" "-cp"
"::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.

2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar"
"-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100" "-Dspark.akka.

frameSize=100" "-Xms512M" "-Xmx512M"
"org.apache.spark.executor.CoarseGrainedExecutorBackend"
"akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra

inedScheduler" "0" "ip-10-202-8-45.ec2.internal" "8"
"akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker"
"app-20140717195146-0000"

========================================
...

14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the custom-built
native-hadoop library...

14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop with
error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path

14/07/17 19:51:47 DEBUG NativeCodeLoader:
java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib

14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Falling back
to shell based

14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Group
mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping

14/07/17 19:51:48 DEBUG Groups: Group mapping
impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback;
cacheTimeout=300000

14/07/17 19:51:48 DEBUG SparkHadoopUtil: running as user: ec2-user

...

14/07/17 19:51:48 INFO CoarseGrainedExecutorBackend: Connecting to driver:
akka.tcp://spark@ip-10-202-11-191.ec2.internal
:46787/user/CoarseGrainedScheduler

14/07/17 19:51:48 INFO WorkerWatcher: Connecting to worker
akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker

14/07/17 19:51:49 INFO WorkerWatcher: Successfully connected to
akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker

14/07/17 19:53:29 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:55670] ->
[akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787] disassociated!
Shutting down.

Thanks a bunch!
Matt

On Thu, Jul 17, 2014 at 1:21 PM, Marcelo Vanzin <va...@cloudera.com> wrote:

> When I meant the executor log, I meant the log of the process launched
> by the worker, not the worker. In my CDH-based Spark install, those
> end up in /var/run/spark/work.
>
> If you look at your worker log, you'll see it's launching the executor
> process. So there should be something there.
>
> Since you say it works when both are run in the same node, that
> probably points to some communication issue, since the executor needs
> to connect back to the driver. Check to see if you don't have any
> firewalls blocking the ports Spark tries to use. (That's one of the
> non-resource-related cases that will cause that message.)
>

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Wed, Jul 16, 2014 at 12:36 PM, Matt Work Coarr
<ma...@gmail.com> wrote:
> Thanks Marcelo, I'm not seeing anything in the logs that clearly explains
> what's causing this to break.
>
> One interesting point that we just discovered is that if we run the driver
> and the slave (worker) on the same host it runs, but if we run the driver on
> a separate host it does not run.

When I meant the executor log, I meant the log of the process launched
by the worker, not the worker. In my CDH-based Spark install, those
end up in /var/run/spark/work.

If you look at your worker log, you'll see it's launching the executor
process. So there should be something there.

Since you say it works when both are run in the same node, that
probably points to some communication issue, since the executor needs
to connect back to the driver. Check to see if you don't have any
firewalls blocking the ports Spark tries to use. (That's one of the
non-resource-related cases that will cause that message.)

-- 
Marcelo

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Posted by Matt Work Coarr <ma...@gmail.com>.

Thanks Marcelo, I'm not seeing anything in the logs that clearly explains
what's causing this to break.

One interesting point that we just discovered is that if we run the driver
and the slave (worker) on the same host it runs, but if we run the driver
on a separate host it does not run.

Anyways, this is all I see on the worker:

14/07/16 19:32:27 INFO Worker: Asked to launch executor
app-20140716193227-0000/0 for Spark Pi

14/07/16 19:32:27 WARN CommandUtils: SPARK_JAVA_OPTS was set on the worker.
It is deprecated in Spark 1.0.

14/07/16 19:32:27 WARN CommandUtils: Set SPARK_LOCAL_DIRS for node-specific
storage locations.

Spark assembly has been built with Hive, including Datanucleus jars on
classpath

14/07/16 19:32:27 INFO ExecutorRunner: Launch command: "/cask/jdk/bin/java"
"-cp"
"::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar"
"-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100"
"-Dspark.akka.frameSize=100" "-Xms512M" "-Xmx512M"
"org.apache.spark.executor.CoarseGrainedExecutorBackend"
"akka.tcp://spark@ip-10-202-11-191.ec2.internal:47740/user/CoarseGrainedScheduler"
"0" "ip-10-202-8-45.ec2.internal" "8"
"akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker"
"app-20140716193227-0000"


And on the driver I see this:

14/07/16 19:32:26 INFO SparkContext: Added JAR
file:/cask/spark/lib/spark-examples-1.0.0-hadoop2.2.0.jar at
http://10.202.11.191:39642/jars/spark-examples-1.0.0-hadoop2.2.0.jar with
timestamp 1405539146752

14/07/16 19:32:26 INFO AppClient$ClientActor: Connecting to master
spark://ip-10-202-9-195.ec2.internal:7077...

14/07/16 19:32:26 INFO SparkContext: Starting job: reduce at
SparkPi.scala:35

14/07/16 19:32:26 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35)
with 2 output partitions (allowLocal=false)

14/07/16 19:32:26 INFO DAGScheduler: Final stage: Stage 0(reduce at
SparkPi.scala:35)

14/07/16 19:32:26 INFO DAGScheduler: Parents of final stage: List()

14/07/16 19:32:26 INFO DAGScheduler: Missing parents: List()

14/07/16 19:32:26 DEBUG DAGScheduler: submitStage(Stage 0)

14/07/16 19:32:26 DEBUG DAGScheduler: missing: List()

14/07/16 19:32:26 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at
map at SparkPi.scala:31), which has no missing parents

14/07/16 19:32:26 DEBUG DAGScheduler: submitMissingTasks(Stage 0)

14/07/16 19:32:26 INFO DAGScheduler: Submitting 2 missing tasks from Stage
0 (MappedRDD[1] at map at SparkPi.scala:31)

14/07/16 19:32:26 DEBUG DAGScheduler: New pending tasks: Set(ResultTask(0,
0), ResultTask(0, 1))

14/07/16 19:32:26 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks

14/07/16 19:32:27 DEBUG TaskSetManager: Epoch for TaskSet 0.0: 0

14/07/16 19:32:27 DEBUG TaskSetManager: Valid locality levels for TaskSet
0.0: ANY

14/07/16 19:32:27 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0,
runningTasks: 0

14/07/16 19:32:27 INFO SparkDeploySchedulerBackend: Connected to Spark
cluster with app ID app-20140716193227-0000

14/07/16 19:32:27 INFO AppClient$ClientActor: Executor added:
app-20140716193227-0000/0 on
worker-20140716193059-ip-10-202-8-45.ec2.internal-7101
(ip-10-202-8-45.ec2.internal:7101) with 8 cores

14/07/16 19:32:27 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20140716193227-0000/0 on hostPort ip-10-202-8-45.ec2.internal:7101 with
8 cores, 512.0 MB RAM

14/07/16 19:32:27 INFO AppClient$ClientActor: Executor updated:
app-20140716193227-0000/0 is now RUNNING


If I wait long enough and see several "inital job has not accepted any
resources" messages on the driver, this shows up in the worker:

14/07/16 19:34:09 INFO Worker: Executor app-20140716193227-0000/0 finished
with state FAILED message Command exited with code 1 exitStatus 1

14/07/16 19:34:09 INFO Worker: Asked to launch executor
app-20140716193227-0000/1 for Spark Pi

14/07/16 19:34:09 WARN CommandUtils: SPARK_JAVA_OPTS was set on the worker.
It is deprecated in Spark 1.0.

14/07/16 19:34:09 WARN CommandUtils: Set SPARK_LOCAL_DIRS for node-specific
storage locations.

14/07/16 19:34:09 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka://sparkWorker/deadLetters] to
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.202.8.45%3A46568-2#593829151]
was not delivered. [1] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.

14/07/16 19:34:09 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101] ->
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]: Error
[Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]] [

akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]

Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: ip-10-202-8-45.ec2.internal/10.202.8.45:46848

]

14/07/16 19:34:09 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101] ->
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]: Error
[Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]] [

akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]

Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: ip-10-202-8-45.ec2.internal/10.202.8.45:46848

]

14/07/16 19:34:09 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101] ->
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]: Error
[Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]] [

akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]

Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: ip-10-202-8-45.ec2.internal/10.202.8.45:46848

]

Spark assembly has been built with Hive, including Datanucleus jars on
classpath

14/07/16 19:34:10 INFO ExecutorRunner: Launch command: "/cask/jdk/bin/java"
"-cp"
"::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar"
"-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100"
"-Dspark.akka.frameSize=100" "-Xms512M" "-Xmx512M"
"org.apache.spark.executor.CoarseGrainedExecutorBackend"
"akka.tcp://spark@ip-10-202-11-191.ec2.internal:47740/user/CoarseGrainedScheduler"
"1" "ip-10-202-8-45.ec2.internal" "8"
"akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker"
"app-20140716193227-0000"


Matt


On Tue, Jul 15, 2014 at 5:47 PM, Marcelo Vanzin <va...@cloudera.com> wrote:

> Have you looked at the slave machine to see if the process has
> actually launched? If it has, have you tried peeking into its log
> file?
>
> (That error is printed whenever the executors fail to report back to
> the driver. Insufficient resources to launch the executor is the most
> common cause of that, but not the only one.)
>
> On Tue, Jul 15, 2014 at 2:43 PM, Matt Work Coarr
> <ma...@gmail.com> wrote:
> > Hello spark folks,
> >
> > I have a simple spark cluster setup but I can't get jobs to run on it.
>  I am
> > using the standlone mode.
> >
> > One master, one slave.  Both machines have 32GB ram and 8 cores.
> >
> > The slave is setup with one worker that has 8 cores and 24GB memory
> > allocated.
> >
> > My application requires 2 cores and 5GB of memory.
> >
> > However, I'm getting the following error:
> >
> > WARN TaskSchedulerImpl: Initial job has not accepted any resources; check
> > your cluster UI to ensure that workers are registered and have sufficient
> > memory
> >
> >
> > What else should I check for?
> >
> > This is a simplified setup (the real cluster has 20 nodes).  In this
> > simplified setup I am running the master and the slave manually.  The
> > master's web page shows the worker and it shows the application and the
> > memory/core requirements match what I mentioned above.
> >
> > I also tried running the SparkPi example via bin/run-example and get the
> > same result.  It requires 8 cores and 512MB of memory, which is also
> clearly
> > within the limits of the available worker.
> >
> > Any ideas would be greatly appreciated!!
> >
> > Matt
>
>
>
> --
> Marcelo
>

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Posted by Marcelo Vanzin <va...@cloudera.com>.

Have you looked at the slave machine to see if the process has
actually launched? If it has, have you tried peeking into its log
file?

(That error is printed whenever the executors fail to report back to
the driver. Insufficient resources to launch the executor is the most
common cause of that, but not the only one.)

On Tue, Jul 15, 2014 at 2:43 PM, Matt Work Coarr
<ma...@gmail.com> wrote:
> Hello spark folks,
>
> I have a simple spark cluster setup but I can't get jobs to run on it.  I am
> using the standlone mode.
>
> One master, one slave.  Both machines have 32GB ram and 8 cores.
>
> The slave is setup with one worker that has 8 cores and 24GB memory
> allocated.
>
> My application requires 2 cores and 5GB of memory.
>
> However, I'm getting the following error:
>
> WARN TaskSchedulerImpl: Initial job has not accepted any resources; check
> your cluster UI to ensure that workers are registered and have sufficient
> memory
>
>
> What else should I check for?
>
> This is a simplified setup (the real cluster has 20 nodes).  In this
> simplified setup I am running the master and the slave manually.  The
> master's web page shows the worker and it shows the application and the
> memory/core requirements match what I mentioned above.
>
> I also tried running the SparkPi example via bin/run-example and get the
> same result.  It requires 8 cores and 512MB of memory, which is also clearly
> within the limits of the available worker.
>
> Any ideas would be greatly appreciated!!
>
> Matt



-- 
Marcelo