You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Shannon Quinn <sq...@gatech.edu> on 2014/06/26 03:07:34 UTC

Spark standalone network configuration problems

Hi all,

I have a 2-machine Spark network I've set up: a master and worker on 
machine1, and worker on machine2. When I run 'sbin/start-all.sh', 
everything starts up as it should. I see both workers listed on the UI 
page. The logs of both workers indicate successful registration with the 
Spark master.

The problems begin when I attempt to submit a job: I get an "address 
already in use" exception that crashes the program. It says "Failed to 
bind to " and lists the exact port and address of the master.

At this point, the only items I have set in my spark-env.sh are 
SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).

The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the 
master to 127.0.0.1. This allows the master to successfully send out the 
jobs; however, it ends up canceling the stage after running this command 
several times:

14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added: 
app-20140625210032-0000/8 on worker-20140625205623-machine2-53597 
(machine2:53597) with 8 cores
14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20140625210032-0000/8 on hostPort machine2:53597 with 8 cores, 8.0 
GB RAM
14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated: 
app-20140625210032-0000/8 is now RUNNING
14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated: 
app-20140625210032-0000/8 is now FAILED (Command exited with code 1)

The "/8" started at "/1", eventually becomes "/9", and then "/10", at 
which point the program crashes. The worker on machine2 shows similar 
messages in its logs. Here are the last bunch:

14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9 
finished with state FAILED message Command exited with code 1 exitStatus 1
14/06/25 21:00:31 INFO Worker: Asked to launch executor 
app-20140625210032-0000/10 for app_name
Spark assembly has been built with Hive, including Datanucleus jars on 
classpath
14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java" "-cp" 
"::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar" 
"-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M" 
"org.apache.spark.executor.CoarseGrainedExecutorBackend" 
"*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*" "10" 
"machine2" "8" "akka.tcp://sparkWorker@machine2:53597/user/Worker" 
"app-20140625210032-0000"
14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10 
finished with state FAILED message Command exited with code 1 exitStatus 1

I highlighted the part that seemed strange to me; that's the master port 
number (I set it to 5060), and yet it's referencing localhost? Is this 
the reason why machine2 apparently can't seem to give a confirmation to 
the master once the job is submitted? (The logs from the worker on the 
master node indicate that it's running just fine)

I appreciate any assistance you can offer!

Regards,
Shannon Quinn

Re: Spark standalone network configuration problems

Posted by Shannon Quinn <sq...@gatech.edu>.

Sorry, master spark URL in the web UI is *spark://192.168.1.101:5060*, 
exactly as configured.

On 6/27/14, 9:07 AM, Shannon Quinn wrote:
> I put the settings as you specified in spark-env.sh for the master. 
> When I run start-all.sh, the web UI shows both the worker on the 
> master (machine1) and the slave worker (machine2) as ALIVE and ready, 
> with the master URL at spark://192.168.1.101. However, when I run 
> spark-submit, it immediately crashes with
>
> py4j.protocol.Py4JJavaError14/06/27 09:01:32 ERROR Remoting: Remoting 
> error: [Startup failed]
> akka.remote.RemoteTransportException: Startup failed
> [...]
> org.jboss.netty.channel.ChannelException: Failed to bind to 
> /192.168.1.101:5060
> [...]
> java.net.BindException: Address already in use.
> [...]
>
> This seems entirely contrary to intuition; why would Spark be unable 
> to bind to the exact IP:port set for the master?
>
> On 6/27/14, 1:54 AM, Akhil Das wrote:
>> Hi Shannon,
>>
>> How about a setting like the following? (just removed the quotes)
>>
>> export SPARK_MASTER_IP=192.168.1.101
>> export SPARK_MASTER_PORT=5060
>> #export SPARK_LOCAL_IP=127.0.0.1
>>
>> Not sure whats happening in your case, it could be that your system 
>> is not able to bind to 192.168.1.101 address. What is the spark:// 
>> master url that you are seeing there in the webUI? (It should be 
>> spark://192.168.1.101:7077 in your case).
>>
>>
>>
>> Thanks
>> Best Regards
>>
>>
>> On Fri, Jun 27, 2014 at 5:47 AM, Shannon Quinn <squinn@gatech.edu 
>> <ma...@gatech.edu>> wrote:
>>
>>     In the interest of completeness, this is how I invoke spark:
>>
>>     [on master]
>>
>>     > sbin/start-all.sh
>>     > spark-submit --py-files extra.py main.py
>>
>>     iPhone'd
>>
>>     On Jun 26, 2014, at 17:29, Shannon Quinn <squinn@gatech.edu
>>     <ma...@gatech.edu>> wrote:
>>
>>>     My *best guess* (please correct me if I'm wrong) is that the
>>>     master (machine1) is sending the command to the worker
>>>     (machine2) with the localhost argument as-is; that is, machine2
>>>     isn't doing any weird address conversion on its end.
>>>
>>>     Consequently, I've been focusing on the settings of the
>>>     master/machine1. But I haven't found anything to indicate where
>>>     the localhost argument could be coming from. /etc/hosts lists
>>>     only 127.0.0.1 as localhost; spark-defaults.conf list
>>>     spark.master as the full IP address (not 127.0.0.1);
>>>     spark-env.sh on the master also lists the full IP under
>>>     SPARK_MASTER_IP. The *only* place on the master where it's
>>>     associated with localhost is SPARK_LOCAL_IP.
>>>
>>>     In looking at the logs of the worker spawned on master, it's
>>>     also receiving a "spark://localhost:5060" argument, but since it
>>>     resides on the master that works fine. Is it possible that the
>>>     master is, for some reason, passing
>>>     "spark://{SPARK_LOCAL_IP}:5060" to the workers?
>>>
>>>     That was my motivation behind commenting out SPARK_LOCAL_IP;
>>>     however, that's when the master crashes immediately due to the
>>>     address already being in use.
>>>
>>>     Any ideas? Thanks!
>>>
>>>     Shannon
>>>
>>>     On 6/26/14, 10:14 AM, Akhil Das wrote:
>>>>     Can you paste your spark-env.sh file?
>>>>
>>>>     Thanks
>>>>     Best Regards
>>>>
>>>>
>>>>     On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn
>>>>     <squinn@gatech.edu <ma...@gatech.edu>> wrote:
>>>>
>>>>         Both /etc/hosts have each other's IP addresses in them.
>>>>         Telneting from machine2 to machine1 on port 5060 works just
>>>>         fine.
>>>>
>>>>         Here's the output of lsof:
>>>>
>>>>         user@machine1:~/spark/spark-1.0.0-bin-hadoop2$
>>>>         <mailto:user@machine1:%7E/spark/spark-1.0.0-bin-hadoop2$>
>>>>         lsof -i:5060
>>>>         COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
>>>>         java    23985 user   30u  IPv6 11092354    0t0  TCP
>>>>         machine1:sip (LISTEN)
>>>>         java    23985 user   40u  IPv6 11099560    0t0  TCP
>>>>         machine1:sip->machine1:48315 (ESTABLISHED)
>>>>         java    23985 user   52u  IPv6 11100405    0t0  TCP
>>>>         machine1:sip->machine2:54476 (ESTABLISHED)
>>>>         java    24157 user   40u  IPv6 11092413    0t0  TCP
>>>>         machine1:48315->machine1:sip (ESTABLISHED)
>>>>
>>>>         Ubuntu seems to recognize 5060 as the standard port for
>>>>         "sip"; it's not actually running anything there besides
>>>>         Spark, it just does a s/5060/sip/g.
>>>>
>>>>         Is there something to the fact that every time I comment
>>>>         out SPARK_LOCAL_IP in spark-env, it crashes immediately
>>>>         upon spark-submit due to the "address already being in
>>>>         use"? Or am I barking up the wrong tree on that one?
>>>>
>>>>         Thanks again for all your help; I hope we can knock this
>>>>         one out.
>>>>
>>>>         Shannon
>>>>
>>>>
>>>>         On 6/26/14, 9:13 AM, Akhil Das wrote:
>>>>>         Do you have <ip>         machine1 in your workers
>>>>>         /etc/hosts also? If so try telneting from your machine2 to
>>>>>         machine1 on port 5060. Also make sure nothing else is
>>>>>         running on port 5060 other than Spark (*/lsof -i:5060/*)
>>>>>
>>>>>         Thanks
>>>>>         Best Regards
>>>>>
>>>>>
>>>>>         On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn
>>>>>         <squinn@gatech.edu <ma...@gatech.edu>> wrote:
>>>>>
>>>>>             Still running into the same problem. /etc/hosts on the
>>>>>             master says
>>>>>
>>>>>             127.0.0.1    localhost
>>>>>             <ip> machine1
>>>>>
>>>>>             <ip> is the same address set in spark-env.sh for
>>>>>             SPARK_MASTER_IP. Any other ideas?
>>>>>
>>>>>
>>>>>             On 6/26/14, 3:11 AM, Akhil Das wrote:
>>>>>>             Hi Shannon,
>>>>>>
>>>>>>             It should be a configuration issue, check in your
>>>>>>             /etc/hosts and make sure localhost is not associated
>>>>>>             with the SPARK_MASTER_IP you provided.
>>>>>>
>>>>>>             Thanks
>>>>>>             Best Regards
>>>>>>
>>>>>>
>>>>>>             On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn
>>>>>>             <squinn@gatech.edu <ma...@gatech.edu>> wrote:
>>>>>>
>>>>>>                 Hi all,
>>>>>>
>>>>>>                 I have a 2-machine Spark network I've set up: a
>>>>>>                 master and worker on machine1, and worker on
>>>>>>                 machine2. When I run 'sbin/start-all.sh',
>>>>>>                 everything starts up as it should. I see both
>>>>>>                 workers listed on the UI page. The logs of both
>>>>>>                 workers indicate successful registration with the
>>>>>>                 Spark master.
>>>>>>
>>>>>>                 The problems begin when I attempt to submit a
>>>>>>                 job: I get an "address already in use" exception
>>>>>>                 that crashes the program. It says "Failed to bind
>>>>>>                 to " and lists the exact port and address of the
>>>>>>                 master.
>>>>>>
>>>>>>                 At this point, the only items I have set in my
>>>>>>                 spark-env.sh are SPARK_MASTER_IP and
>>>>>>                 SPARK_MASTER_PORT (non-standard, set to 5060).
>>>>>>
>>>>>>                 The next step I took, then, was to explicitly set
>>>>>>                 SPARK_LOCAL_IP on the master to 127.0.0.1. This
>>>>>>                 allows the master to successfully send out the
>>>>>>                 jobs; however, it ends up canceling the stage
>>>>>>                 after running this command several times:
>>>>>>
>>>>>>                 14/06/25 21:00:47 INFO AppClient$ClientActor:
>>>>>>                 Executor added: app-20140625210032-0000/8 on
>>>>>>                 worker-20140625205623-machine2-53597
>>>>>>                 (machine2:53597) with 8 cores
>>>>>>                 14/06/25 21:00:47 INFO
>>>>>>                 SparkDeploySchedulerBackend: Granted executor ID
>>>>>>                 app-20140625210032-0000/8 on hostPort
>>>>>>                 machine2:53597 with 8 cores, 8.0 GB RAM
>>>>>>                 14/06/25 21:00:47 INFO AppClient$ClientActor:
>>>>>>                 Executor updated: app-20140625210032-0000/8 is
>>>>>>                 now RUNNING
>>>>>>                 14/06/25 21:00:49 INFO AppClient$ClientActor:
>>>>>>                 Executor updated: app-20140625210032-0000/8 is
>>>>>>                 now FAILED (Command exited with code 1)
>>>>>>
>>>>>>                 The "/8" started at "/1", eventually becomes
>>>>>>                 "/9", and then "/10", at which point the program
>>>>>>                 crashes. The worker on machine2 shows similar
>>>>>>                 messages in its logs. Here are the last bunch:
>>>>>>
>>>>>>                 14/06/25 21:00:31 INFO Worker: Executor
>>>>>>                 app-20140625210032-0000/9 finished with state
>>>>>>                 FAILED message Command exited with code 1
>>>>>>                 exitStatus 1
>>>>>>                 14/06/25 21:00:31 INFO Worker: Asked to launch
>>>>>>                 executor app-20140625210032-0000/10 for app_name
>>>>>>                 Spark assembly has been built with Hive,
>>>>>>                 including Datanucleus jars on classpath
>>>>>>                 14/06/25 21:00:32 INFO ExecutorRunner: Launch
>>>>>>                 command: "java" "-cp"
>>>>>>                 "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>>>>>                 "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>>>>>>                 "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>>>>>>                 "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
>>>>>>                 "10" "machine2" "8"
>>>>>>                 "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>>>>>>                 "app-20140625210032-0000"
>>>>>>                 14/06/25 21:00:33 INFO Worker: Executor
>>>>>>                 app-20140625210032-0000/10 finished with state
>>>>>>                 FAILED message Command exited with code 1
>>>>>>                 exitStatus 1
>>>>>>
>>>>>>                 I highlighted the part that seemed strange to me;
>>>>>>                 that's the master port number (I set it to 5060),
>>>>>>                 and yet it's referencing localhost? Is this the
>>>>>>                 reason why machine2 apparently can't seem to give
>>>>>>                 a confirmation to the master once the job is
>>>>>>                 submitted? (The logs from the worker on the
>>>>>>                 master node indicate that it's running just fine)
>>>>>>
>>>>>>                 I appreciate any assistance you can offer!
>>>>>>
>>>>>>                 Regards,
>>>>>>                 Shannon Quinn
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Spark standalone network configuration problems

Posted by Shannon Quinn <sq...@gatech.edu>.

I put the settings as you specified in spark-env.sh for the master. When 
I run start-all.sh, the web UI shows both the worker on the master 
(machine1) and the slave worker (machine2) as ALIVE and ready, with the 
master URL at spark://192.168.1.101. However, when I run spark-submit, 
it immediately crashes with

py4j.protocol.Py4JJavaError14/06/27 09:01:32 ERROR Remoting: Remoting 
error: [Startup failed]
akka.remote.RemoteTransportException: Startup failed
[...]
org.jboss.netty.channel.ChannelException: Failed to bind to 
/192.168.1.101:5060
[...]
java.net.BindException: Address already in use.
[...]

This seems entirely contrary to intuition; why would Spark be unable to 
bind to the exact IP:port set for the master?

On 6/27/14, 1:54 AM, Akhil Das wrote:
> Hi Shannon,
>
> How about a setting like the following? (just removed the quotes)
>
> export SPARK_MASTER_IP=192.168.1.101
> export SPARK_MASTER_PORT=5060
> #export SPARK_LOCAL_IP=127.0.0.1
>
> Not sure whats happening in your case, it could be that your system is 
> not able to bind to 192.168.1.101 address. What is the spark:// master 
> url that you are seeing there in the webUI? (It should be 
> spark://192.168.1.101:7077 in your case).
>
>
>
> Thanks
> Best Regards
>
>
> On Fri, Jun 27, 2014 at 5:47 AM, Shannon Quinn <squinn@gatech.edu 
> <ma...@gatech.edu>> wrote:
>
>     In the interest of completeness, this is how I invoke spark:
>
>     [on master]
>
>     > sbin/start-all.sh
>     > spark-submit --py-files extra.py main.py
>
>     iPhone'd
>
>     On Jun 26, 2014, at 17:29, Shannon Quinn <squinn@gatech.edu
>     <ma...@gatech.edu>> wrote:
>
>>     My *best guess* (please correct me if I'm wrong) is that the
>>     master (machine1) is sending the command to the worker (machine2)
>>     with the localhost argument as-is; that is, machine2 isn't doing
>>     any weird address conversion on its end.
>>
>>     Consequently, I've been focusing on the settings of the
>>     master/machine1. But I haven't found anything to indicate where
>>     the localhost argument could be coming from. /etc/hosts lists
>>     only 127.0.0.1 as localhost; spark-defaults.conf list
>>     spark.master as the full IP address (not 127.0.0.1); spark-env.sh
>>     on the master also lists the full IP under SPARK_MASTER_IP. The
>>     *only* place on the master where it's associated with localhost
>>     is SPARK_LOCAL_IP.
>>
>>     In looking at the logs of the worker spawned on master, it's also
>>     receiving a "spark://localhost:5060" argument, but since it
>>     resides on the master that works fine. Is it possible that the
>>     master is, for some reason, passing
>>     "spark://{SPARK_LOCAL_IP}:5060" to the workers?
>>
>>     That was my motivation behind commenting out SPARK_LOCAL_IP;
>>     however, that's when the master crashes immediately due to the
>>     address already being in use.
>>
>>     Any ideas? Thanks!
>>
>>     Shannon
>>
>>     On 6/26/14, 10:14 AM, Akhil Das wrote:
>>>     Can you paste your spark-env.sh file?
>>>
>>>     Thanks
>>>     Best Regards
>>>
>>>
>>>     On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn
>>>     <squinn@gatech.edu <ma...@gatech.edu>> wrote:
>>>
>>>         Both /etc/hosts have each other's IP addresses in them.
>>>         Telneting from machine2 to machine1 on port 5060 works just
>>>         fine.
>>>
>>>         Here's the output of lsof:
>>>
>>>         user@machine1:~/spark/spark-1.0.0-bin-hadoop2$
>>>         <mailto:user@machine1:%7E/spark/spark-1.0.0-bin-hadoop2$>
>>>         lsof -i:5060
>>>         COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
>>>         java    23985 user   30u  IPv6 11092354  0t0  TCP
>>>         machine1:sip (LISTEN)
>>>         java    23985 user   40u  IPv6 11099560  0t0  TCP
>>>         machine1:sip->machine1:48315 (ESTABLISHED)
>>>         java    23985 user   52u  IPv6 11100405  0t0  TCP
>>>         machine1:sip->machine2:54476 (ESTABLISHED)
>>>         java    24157 user   40u  IPv6 11092413  0t0  TCP
>>>         machine1:48315->machine1:sip (ESTABLISHED)
>>>
>>>         Ubuntu seems to recognize 5060 as the standard port for
>>>         "sip"; it's not actually running anything there besides
>>>         Spark, it just does a s/5060/sip/g.
>>>
>>>         Is there something to the fact that every time I comment out
>>>         SPARK_LOCAL_IP in spark-env, it crashes immediately upon
>>>         spark-submit due to the "address already being in use"? Or
>>>         am I barking up the wrong tree on that one?
>>>
>>>         Thanks again for all your help; I hope we can knock this one
>>>         out.
>>>
>>>         Shannon
>>>
>>>
>>>         On 6/26/14, 9:13 AM, Akhil Das wrote:
>>>>         Do you have <ip>         machine1 in your workers
>>>>         /etc/hosts also? If so try telneting from your machine2 to
>>>>         machine1 on port 5060. Also make sure nothing else is
>>>>         running on port 5060 other than Spark (*/lsof -i:5060/*)
>>>>
>>>>         Thanks
>>>>         Best Regards
>>>>
>>>>
>>>>         On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn
>>>>         <squinn@gatech.edu <ma...@gatech.edu>> wrote:
>>>>
>>>>             Still running into the same problem. /etc/hosts on the
>>>>             master says
>>>>
>>>>             127.0.0.1    localhost
>>>>             <ip> machine1
>>>>
>>>>             <ip> is the same address set in spark-env.sh for
>>>>             SPARK_MASTER_IP. Any other ideas?
>>>>
>>>>
>>>>             On 6/26/14, 3:11 AM, Akhil Das wrote:
>>>>>             Hi Shannon,
>>>>>
>>>>>             It should be a configuration issue, check in your
>>>>>             /etc/hosts and make sure localhost is not associated
>>>>>             with the SPARK_MASTER_IP you provided.
>>>>>
>>>>>             Thanks
>>>>>             Best Regards
>>>>>
>>>>>
>>>>>             On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn
>>>>>             <squinn@gatech.edu <ma...@gatech.edu>> wrote:
>>>>>
>>>>>                 Hi all,
>>>>>
>>>>>                 I have a 2-machine Spark network I've set up: a
>>>>>                 master and worker on machine1, and worker on
>>>>>                 machine2. When I run 'sbin/start-all.sh',
>>>>>                 everything starts up as it should. I see both
>>>>>                 workers listed on the UI page. The logs of both
>>>>>                 workers indicate successful registration with the
>>>>>                 Spark master.
>>>>>
>>>>>                 The problems begin when I attempt to submit a job:
>>>>>                 I get an "address already in use" exception that
>>>>>                 crashes the program. It says "Failed to bind to "
>>>>>                 and lists the exact port and address of the master.
>>>>>
>>>>>                 At this point, the only items I have set in my
>>>>>                 spark-env.sh are SPARK_MASTER_IP and
>>>>>                 SPARK_MASTER_PORT (non-standard, set to 5060).
>>>>>
>>>>>                 The next step I took, then, was to explicitly set
>>>>>                 SPARK_LOCAL_IP on the master to 127.0.0.1. This
>>>>>                 allows the master to successfully send out the
>>>>>                 jobs; however, it ends up canceling the stage
>>>>>                 after running this command several times:
>>>>>
>>>>>                 14/06/25 21:00:47 INFO AppClient$ClientActor:
>>>>>                 Executor added: app-20140625210032-0000/8 on
>>>>>                 worker-20140625205623-machine2-53597
>>>>>                 (machine2:53597) with 8 cores
>>>>>                 14/06/25 21:00:47 INFO
>>>>>                 SparkDeploySchedulerBackend: Granted executor ID
>>>>>                 app-20140625210032-0000/8 on hostPort
>>>>>                 machine2:53597 with 8 cores, 8.0 GB RAM
>>>>>                 14/06/25 21:00:47 INFO AppClient$ClientActor:
>>>>>                 Executor updated: app-20140625210032-0000/8 is now
>>>>>                 RUNNING
>>>>>                 14/06/25 21:00:49 INFO AppClient$ClientActor:
>>>>>                 Executor updated: app-20140625210032-0000/8 is now
>>>>>                 FAILED (Command exited with code 1)
>>>>>
>>>>>                 The "/8" started at "/1", eventually becomes "/9",
>>>>>                 and then "/10", at which point the program
>>>>>                 crashes. The worker on machine2 shows similar
>>>>>                 messages in its logs. Here are the last bunch:
>>>>>
>>>>>                 14/06/25 21:00:31 INFO Worker: Executor
>>>>>                 app-20140625210032-0000/9 finished with state
>>>>>                 FAILED message Command exited with code 1 exitStatus 1
>>>>>                 14/06/25 21:00:31 INFO Worker: Asked to launch
>>>>>                 executor app-20140625210032-0000/10 for app_name
>>>>>                 Spark assembly has been built with Hive, including
>>>>>                 Datanucleus jars on classpath
>>>>>                 14/06/25 21:00:32 INFO ExecutorRunner: Launch
>>>>>                 command: "java" "-cp"
>>>>>                 "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>>>>                 "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>>>>>                 "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>>>>>                 "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
>>>>>                 "10" "machine2" "8"
>>>>>                 "akka.tcp://sparkWorker@machine2:53597/user/Worker" "app-20140625210032-0000"
>>>>>                 14/06/25 21:00:33 INFO Worker: Executor
>>>>>                 app-20140625210032-0000/10 finished with state
>>>>>                 FAILED message Command exited with code 1 exitStatus 1
>>>>>
>>>>>                 I highlighted the part that seemed strange to me;
>>>>>                 that's the master port number (I set it to 5060),
>>>>>                 and yet it's referencing localhost? Is this the
>>>>>                 reason why machine2 apparently can't seem to give
>>>>>                 a confirmation to the master once the job is
>>>>>                 submitted? (The logs from the worker on the master
>>>>>                 node indicate that it's running just fine)
>>>>>
>>>>>                 I appreciate any assistance you can offer!
>>>>>
>>>>>                 Regards,
>>>>>                 Shannon Quinn
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Spark standalone network configuration problems

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Hi Shannon,

How about a setting like the following? (just removed the quotes)

export SPARK_MASTER_IP=192.168.1.101
export SPARK_MASTER_PORT=5060
#export SPARK_LOCAL_IP=127.0.0.1

Not sure whats happening in your case, it could be that your system is not
able to bind to 192.168.1.101 address. What is the spark:// master url that
you are seeing there in the webUI? (It should be spark://192.168.1.101:7077
in your case).



Thanks
Best Regards


On Fri, Jun 27, 2014 at 5:47 AM, Shannon Quinn <sq...@gatech.edu> wrote:

> In the interest of completeness, this is how I invoke spark:
>
> [on master]
>
> > sbin/start-all.sh
> > spark-submit --py-files extra.py main.py
>
> iPhone'd
>
> On Jun 26, 2014, at 17:29, Shannon Quinn <sq...@gatech.edu> wrote:
>
> My *best guess* (please correct me if I'm wrong) is that the master
> (machine1) is sending the command to the worker (machine2) with the
> localhost argument as-is; that is, machine2 isn't doing any weird address
> conversion on its end.
>
> Consequently, I've been focusing on the settings of the master/machine1.
> But I haven't found anything to indicate where the localhost argument could
> be coming from. /etc/hosts lists only 127.0.0.1 as localhost;
> spark-defaults.conf list spark.master as the full IP address (not
> 127.0.0.1); spark-env.sh on the master also lists the full IP under
> SPARK_MASTER_IP. The *only* place on the master where it's associated with
> localhost is SPARK_LOCAL_IP.
>
> In looking at the logs of the worker spawned on master, it's also
> receiving a "spark://localhost:5060" argument, but since it resides on the
> master that works fine. Is it possible that the master is, for some reason,
> passing "spark://{SPARK_LOCAL_IP}:5060" to the workers?
>
> That was my motivation behind commenting out SPARK_LOCAL_IP; however,
> that's when the master crashes immediately due to the address already being
> in use.
>
> Any ideas? Thanks!
>
> Shannon
>
> On 6/26/14, 10:14 AM, Akhil Das wrote:
>
>  Can you paste your spark-env.sh file?
>
>  Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>
>>  Both /etc/hosts have each other's IP addresses in them. Telneting from
>> machine2 to machine1 on port 5060 works just fine.
>>
>> Here's the output of lsof:
>>
>> user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
>> COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
>> java    23985 user   30u  IPv6 11092354      0t0  TCP machine1:sip
>> (LISTEN)
>> java    23985 user   40u  IPv6 11099560      0t0  TCP
>> machine1:sip->machine1:48315 (ESTABLISHED)
>> java    23985 user   52u  IPv6 11100405      0t0  TCP
>> machine1:sip->machine2:54476 (ESTABLISHED)
>> java    24157 user   40u  IPv6 11092413      0t0  TCP
>> machine1:48315->machine1:sip (ESTABLISHED)
>>
>> Ubuntu seems to recognize 5060 as the standard port for "sip"; it's not
>> actually running anything there besides Spark, it just does a s/5060/sip/g.
>>
>> Is there something to the fact that every time I comment out
>> SPARK_LOCAL_IP in spark-env, it crashes immediately upon spark-submit due
>> to the "address already being in use"? Or am I barking up the wrong tree on
>> that one?
>>
>> Thanks again for all your help; I hope we can knock this one out.
>>
>> Shannon
>>
>>
>> On 6/26/14, 9:13 AM, Akhil Das wrote:
>>
>>  Do you have <ip>            machine1 in your workers /etc/hosts also?
>> If so try telneting from your machine2 to machine1 on port 5060. Also make
>> sure nothing else is running on port 5060 other than Spark (*lsof
>> -i:5060*)
>>
>>  Thanks
>> Best Regards
>>
>>
>> On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>>
>>>  Still running into the same problem. /etc/hosts on the master says
>>>
>>> 127.0.0.1    localhost
>>> <ip>            machine1
>>>
>>> <ip> is the same address set in spark-env.sh for SPARK_MASTER_IP. Any
>>> other ideas?
>>>
>>>
>>> On 6/26/14, 3:11 AM, Akhil Das wrote:
>>>
>>>  Hi Shannon,
>>>
>>>  It should be a configuration issue, check in your /etc/hosts and make
>>> sure localhost is not associated with the SPARK_MASTER_IP you provided.
>>>
>>>  Thanks
>>> Best Regards
>>>
>>>
>>> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <sq...@gatech.edu>
>>> wrote:
>>>
>>>>  Hi all,
>>>>
>>>> I have a 2-machine Spark network I've set up: a master and worker on
>>>> machine1, and worker on machine2. When I run 'sbin/start-all.sh',
>>>> everything starts up as it should. I see both workers listed on the UI
>>>> page. The logs of both workers indicate successful registration with the
>>>> Spark master.
>>>>
>>>> The problems begin when I attempt to submit a job: I get an "address
>>>> already in use" exception that crashes the program. It says "Failed to bind
>>>> to " and lists the exact port and address of the master.
>>>>
>>>> At this point, the only items I have set in my spark-env.sh are
>>>> SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
>>>>
>>>> The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the
>>>> master to 127.0.0.1. This allows the master to successfully send out the
>>>> jobs; however, it ends up canceling the stage after running this command
>>>> several times:
>>>>
>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added:
>>>> app-20140625210032-0000/8 on worker-20140625205623-machine2-53597
>>>> (machine2:53597) with 8 cores
>>>> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID
>>>> app-20140625210032-0000/8 on hostPort machine2:53597 with 8 cores, 8.0 GB
>>>> RAM
>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated:
>>>> app-20140625210032-0000/8 is now RUNNING
>>>> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated:
>>>> app-20140625210032-0000/8 is now FAILED (Command exited with code 1)
>>>>
>>>> The "/8" started at "/1", eventually becomes "/9", and then "/10", at
>>>> which point the program crashes. The worker on machine2 shows similar
>>>> messages in its logs. Here are the last bunch:
>>>>
>>>> 14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9
>>>> finished with state FAILED message Command exited with code 1 exitStatus 1
>>>> 14/06/25 21:00:31 INFO Worker: Asked to launch executor
>>>> app-20140625210032-0000/10 for app_name
>>>> Spark assembly has been built with Hive, including Datanucleus jars on
>>>> classpath
>>>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java" "-cp"
>>>> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>>> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>>>> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "
>>>> *akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*" "10"
>>>> "machine2" "8" "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>>>> "app-20140625210032-0000"
>>>> 14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10
>>>> finished with state FAILED message Command exited with code 1 exitStatus 1
>>>>
>>>> I highlighted the part that seemed strange to me; that's the master
>>>> port number (I set it to 5060), and yet it's referencing localhost? Is this
>>>> the reason why machine2 apparently can't seem to give a confirmation to the
>>>> master once the job is submitted? (The logs from the worker on the master
>>>> node indicate that it's running just fine)
>>>>
>>>> I appreciate any assistance you can offer!
>>>>
>>>> Regards,
>>>> Shannon Quinn
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Spark standalone network configuration problems

Posted by Shannon Quinn <sq...@gatech.edu>.

No joy, unfortunately. Same issue; see my previous email--still crashes 
with "address already in use."

On 6/27/14, 1:54 AM, sujeetv wrote:
> Try to explicitly set set the "spark.driver.host" property to the master's
> IP.
> Sujeet
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark standalone network configuration problems

Posted by Shannon Quinn <sq...@gatech.edu>.

Apologies; can you advise as to how I would check that? I can certainly 
SSH from master to machine2.

On 6/27/14, 3:22 PM, Sujeet Varakhedi wrote:
> Looks like your driver is not able to connect to the remote executor 
> on machine2/130.49.226.148:60949 <http://130.49.226.148:60949/>.  Cn 
> you check if the master machine can route to 130.49.226.148
>
> Sujeet
>
>
> On Fri, Jun 27, 2014 at 12:04 PM, Shannon Quinn <squinn@gatech.edu 
> <ma...@gatech.edu>> wrote:
>
>     For some reason, commenting out spark.driver.host and
>     spark.driver.port fixed something...and broke something else (or
>     at least revealed another problem). For reference, the only lines
>     I have in my spark-defaults.conf now:
>
>     spark.app.name <http://spark.app.name>          myProg
>     spark.master            spark://192.168.1.101:5060
>     <http://192.168.1.101:5060>
>     spark.executor.memory   8g
>     spark.files.overwrite   true
>
>     It starts up, but has problems with machine2. For some reason,
>     machine2 is having trouble communicating with *itself*. Here are
>     the worker logs of one of the failures (there are 10 before it
>     quits):
>
>
>     Spark assembly has been built with Hive, including Datanucleus
>     jars on classpath
>     14/06/27 14:55:13 INFO ExecutorRunner: Launch command: "java"
>     "-cp"
>     "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>     "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>     "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>     "akka.tcp://spark@machine1:46378/user/CoarseGrainedScheduler" "7"
>     "machine2" "8" "akka.tcp://sparkWorker@machine2:48019/user/Worker"
>     "app-20140627144512-0001"
>     14/06/27 14:56:54 INFO Worker: Executor app-20140627144512-0001/7
>     finished with state FAILED message Command exited with code 1
>     exitStatus 1
>     14/06/27 14:56:54 INFO LocalActorRef: Message
>     [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
>     from Actor[akka://sparkWorker/deadLetters] to
>     Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40130.49.226.148%3A53561-38#-1924573003]
>     was not delivered. [10] dead letters encountered. This logging can
>     be turned off or adjusted with configuration settings
>     'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
>     14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
>     [akka.tcp://sparkWorker@machine2:48019] ->
>     [akka.tcp://sparkExecutor@machine2:60949]: Error [Association
>     failed with [akka.tcp://sparkExecutor@machine2:60949]] [
>     akka.remote.EndpointAssociationException: Association failed with
>     [akka.tcp://sparkExecutor@machine2:60949]
>     Caused by:
>     akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>     Connection refused: machine2/130.49.226.148:60949
>     <http://130.49.226.148:60949>
>     ]
>     14/06/27 14:56:54 INFO Worker: Asked to launch executor
>     app-20140627144512-0001/8 for Funtown, USA
>     14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
>     [akka.tcp://sparkWorker@machine2:48019] ->
>     [akka.tcp://sparkExecutor@machine2:60949]: Error [Association
>     failed with [akka.tcp://sparkExecutor@machine2:60949]] [
>     akka.remote.EndpointAssociationException: Association failed with
>     [akka.tcp://sparkExecutor@machine2:60949]
>     Caused by:
>     akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>     Connection refused: machine2/130.49.226.148:60949
>     <http://130.49.226.148:60949>
>     ]
>     14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
>     [akka.tcp://sparkWorker@machine2:48019] ->
>     [akka.tcp://sparkExecutor@machine2:60949]: Error [Association
>     failed with [akka.tcp://sparkExecutor@machine2:60949]] [
>     akka.remote.EndpointAssociationException: Association failed with
>     [akka.tcp://sparkExecutor@machine2:60949]
>     Caused by:
>     akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>     Connection refused: machine2/130.49.226.148:60949
>     <http://130.49.226.148:60949>
>     ]
>
>     Port 48019 on machine2 is indeed open, connected, and listening.
>     Any ideas?
>
>     Thanks!
>
>     Shannon
>
>     On 6/27/14, 1:54 AM, sujeetv wrote:
>
>         Try to explicitly set set the "spark.driver.host" property to
>         the master's
>         IP.
>         Sujeet
>
>
>
>         --
>         View this message in context:
>         http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html
>         Sent from the Apache Spark User List mailing list archive at
>         Nabble.com.
>
>
>

Re: Spark standalone network configuration problems

Posted by Shannon Quinn <sq...@gatech.edu>.

I switched which machine was the master and which was the dedicated 
worker, and now it works just fine. I discovered machine2 is on my 
department's DMZ; machine1 is not. I suspect the departmental firewall 
was causing problems. By moving the master to machine2, that seems to 
have solved my problems.

Thank you all very much for your help. I'm sure I'll have other 
questions soon :)

Regards,
Shannon

On 6/27/14, 3:22 PM, Sujeet Varakhedi wrote:
> Looks like your driver is not able to connect to the remote executor 
> on machine2/130.49.226.148:60949 <http://130.49.226.148:60949/>.  Cn 
> you check if the master machine can route to 130.49.226.148
>
> Sujeet
>
>
> On Fri, Jun 27, 2014 at 12:04 PM, Shannon Quinn <squinn@gatech.edu 
> <ma...@gatech.edu>> wrote:
>
>     For some reason, commenting out spark.driver.host and
>     spark.driver.port fixed something...and broke something else (or
>     at least revealed another problem). For reference, the only lines
>     I have in my spark-defaults.conf now:
>
>     spark.app.name <http://spark.app.name>          myProg
>     spark.master            spark://192.168.1.101:5060
>     <http://192.168.1.101:5060>
>     spark.executor.memory   8g
>     spark.files.overwrite   true
>
>     It starts up, but has problems with machine2. For some reason,
>     machine2 is having trouble communicating with *itself*. Here are
>     the worker logs of one of the failures (there are 10 before it
>     quits):
>
>
>     Spark assembly has been built with Hive, including Datanucleus
>     jars on classpath
>     14/06/27 14:55:13 INFO ExecutorRunner: Launch command: "java"
>     "-cp"
>     "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>     "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>     "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>     "akka.tcp://spark@machine1:46378/user/CoarseGrainedScheduler" "7"
>     "machine2" "8" "akka.tcp://sparkWorker@machine2:48019/user/Worker"
>     "app-20140627144512-0001"
>     14/06/27 14:56:54 INFO Worker: Executor app-20140627144512-0001/7
>     finished with state FAILED message Command exited with code 1
>     exitStatus 1
>     14/06/27 14:56:54 INFO LocalActorRef: Message
>     [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
>     from Actor[akka://sparkWorker/deadLetters] to
>     Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40130.49.226.148%3A53561-38#-1924573003]
>     was not delivered. [10] dead letters encountered. This logging can
>     be turned off or adjusted with configuration settings
>     'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
>     14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
>     [akka.tcp://sparkWorker@machine2:48019] ->
>     [akka.tcp://sparkExecutor@machine2:60949]: Error [Association
>     failed with [akka.tcp://sparkExecutor@machine2:60949]] [
>     akka.remote.EndpointAssociationException: Association failed with
>     [akka.tcp://sparkExecutor@machine2:60949]
>     Caused by:
>     akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>     Connection refused: machine2/130.49.226.148:60949
>     <http://130.49.226.148:60949>
>     ]
>     14/06/27 14:56:54 INFO Worker: Asked to launch executor
>     app-20140627144512-0001/8 for Funtown, USA
>     14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
>     [akka.tcp://sparkWorker@machine2:48019] ->
>     [akka.tcp://sparkExecutor@machine2:60949]: Error [Association
>     failed with [akka.tcp://sparkExecutor@machine2:60949]] [
>     akka.remote.EndpointAssociationException: Association failed with
>     [akka.tcp://sparkExecutor@machine2:60949]
>     Caused by:
>     akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>     Connection refused: machine2/130.49.226.148:60949
>     <http://130.49.226.148:60949>
>     ]
>     14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
>     [akka.tcp://sparkWorker@machine2:48019] ->
>     [akka.tcp://sparkExecutor@machine2:60949]: Error [Association
>     failed with [akka.tcp://sparkExecutor@machine2:60949]] [
>     akka.remote.EndpointAssociationException: Association failed with
>     [akka.tcp://sparkExecutor@machine2:60949]
>     Caused by:
>     akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>     Connection refused: machine2/130.49.226.148:60949
>     <http://130.49.226.148:60949>
>     ]
>
>     Port 48019 on machine2 is indeed open, connected, and listening.
>     Any ideas?
>
>     Thanks!
>
>     Shannon
>
>     On 6/27/14, 1:54 AM, sujeetv wrote:
>
>         Try to explicitly set set the "spark.driver.host" property to
>         the master's
>         IP.
>         Sujeet
>
>
>
>         --
>         View this message in context:
>         http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html
>         Sent from the Apache Spark User List mailing list archive at
>         Nabble.com.
>
>
>

Re: Spark standalone network configuration problems

Posted by Sujeet Varakhedi <sv...@gopivotal.com>.

Looks like your driver is not able to connect to the remote executor on
machine2/130.49.226.148:60949.  Cn you check if the master machine can
route to 130.49.226.148

Sujeet


On Fri, Jun 27, 2014 at 12:04 PM, Shannon Quinn <sq...@gatech.edu> wrote:

> For some reason, commenting out spark.driver.host and spark.driver.port
> fixed something...and broke something else (or at least revealed another
> problem). For reference, the only lines I have in my spark-defaults.conf
> now:
>
> spark.app.name          myProg
> spark.master            spark://192.168.1.101:5060
> spark.executor.memory   8g
> spark.files.overwrite   true
>
> It starts up, but has problems with machine2. For some reason, machine2 is
> having trouble communicating with *itself*. Here are the worker logs of one
> of the failures (there are 10 before it quits):
>
>
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
> 14/06/27 14:55:13 INFO ExecutorRunner: Launch command: "java" "-cp"
> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/
> spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.
> 2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-
> rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/
> datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-
> hadoop2/lib/datanucleus-api-jdo-3.2.1.jar" "-XX:MaxPermSize=128m"
> "-Xms8192M" "-Xmx8192M" "org.apache.spark.executor.CoarseGrainedExecutorBackend"
> "akka.tcp://spark@machine1:46378/user/CoarseGrainedScheduler" "7"
> "machine2" "8" "akka.tcp://sparkWorker@machine2:48019/user/Worker"
> "app-20140627144512-0001"
> 14/06/27 14:56:54 INFO Worker: Executor app-20140627144512-0001/7 finished
> with state FAILED message Command exited with code 1 exitStatus 1
> 14/06/27 14:56:54 INFO LocalActorRef: Message [akka.remote.transport.
> ActorTransportAdapter$DisassociateUnderlying] from
> Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/
> system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%
> 2FsparkWorker%40130.49.226.148%3A53561-38#-1924573003] was not delivered.
> [10] dead letters encountered. This logging can be turned off or adjusted
> with configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@machine2:48019] -> [akka.tcp://sparkExecutor@machine2:60949]:
> Error [Association failed with [akka.tcp://sparkExecutor@machine2:60949]]
> [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@machine2:60949]
> Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: machine2/130.49.226.148:60949
> ]
> 14/06/27 14:56:54 INFO Worker: Asked to launch executor
> app-20140627144512-0001/8 for Funtown, USA
> 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@machine2:48019] -> [akka.tcp://sparkExecutor@machine2:60949]:
> Error [Association failed with [akka.tcp://sparkExecutor@machine2:60949]]
> [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@machine2:60949]
> Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: machine2/130.49.226.148:60949
> ]
> 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@machine2:48019] -> [akka.tcp://sparkExecutor@machine2:60949]:
> Error [Association failed with [akka.tcp://sparkExecutor@machine2:60949]]
> [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@machine2:60949]
> Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: machine2/130.49.226.148:60949
> ]
>
> Port 48019 on machine2 is indeed open, connected, and listening. Any ideas?
>
> Thanks!
>
> Shannon
>
> On 6/27/14, 1:54 AM, sujeetv wrote:
>
>> Try to explicitly set set the "spark.driver.host" property to the master's
>> IP.
>> Sujeet
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-
>> tp8304p8396.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Re: Spark standalone network configuration problems

Posted by Shannon Quinn <sq...@gatech.edu>.

For some reason, commenting out spark.driver.host and spark.driver.port 
fixed something...and broke something else (or at least revealed another 
problem). For reference, the only lines I have in my spark-defaults.conf 
now:

spark.app.name          myProg
spark.master            spark://192.168.1.101:5060
spark.executor.memory   8g
spark.files.overwrite   true

It starts up, but has problems with machine2. For some reason, machine2 
is having trouble communicating with *itself*. Here are the worker logs 
of one of the failures (there are 10 before it quits):

Spark assembly has been built with Hive, including Datanucleus jars on 
classpath
14/06/27 14:55:13 INFO ExecutorRunner: Launch command: "java" "-cp" 
"::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar" 
"-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M" 
"org.apache.spark.executor.CoarseGrainedExecutorBackend" 
"akka.tcp://spark@machine1:46378/user/CoarseGrainedScheduler" "7" 
"machine2" "8" "akka.tcp://sparkWorker@machine2:48019/user/Worker" 
"app-20140627144512-0001"
14/06/27 14:56:54 INFO Worker: Executor app-20140627144512-0001/7 
finished with state FAILED message Command exited with code 1 exitStatus 1
14/06/27 14:56:54 INFO LocalActorRef: Message 
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] 
from Actor[akka://sparkWorker/deadLetters] to 
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40130.49.226.148%3A53561-38#-1924573003] 
was not delivered. [10] dead letters encountered. This logging can be 
turned off or adjusted with configuration settings 
'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/06/27 14:56:54 ERROR EndpointWriter: AssociationError 
[akka.tcp://sparkWorker@machine2:48019] -> 
[akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed 
with [akka.tcp://sparkExecutor@machine2:60949]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkExecutor@machine2:60949]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: machine2/130.49.226.148:60949
]
14/06/27 14:56:54 INFO Worker: Asked to launch executor 
app-20140627144512-0001/8 for Funtown, USA
14/06/27 14:56:54 ERROR EndpointWriter: AssociationError 
[akka.tcp://sparkWorker@machine2:48019] -> 
[akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed 
with [akka.tcp://sparkExecutor@machine2:60949]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkExecutor@machine2:60949]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: machine2/130.49.226.148:60949
]
14/06/27 14:56:54 ERROR EndpointWriter: AssociationError 
[akka.tcp://sparkWorker@machine2:48019] -> 
[akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed 
with [akka.tcp://sparkExecutor@machine2:60949]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkExecutor@machine2:60949]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: machine2/130.49.226.148:60949
]

Port 48019 on machine2 is indeed open, connected, and listening. Any ideas?

Thanks!

Shannon

On 6/27/14, 1:54 AM, sujeetv wrote:
> Try to explicitly set set the "spark.driver.host" property to the master's
> IP.
> Sujeet
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark standalone network configuration problems

Posted by sujeetv <sv...@gmail.com>.

Try to explicitly set set the "spark.driver.host" property to the master's
IP. 
Sujeet



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark standalone network configuration problems

Posted by Shannon Quinn <sq...@gatech.edu>.

In the interest of completeness, this is how I invoke spark:

[on master]

> sbin/start-all.sh
> spark-submit --py-files extra.py main.py

iPhone'd

> On Jun 26, 2014, at 17:29, Shannon Quinn <sq...@gatech.edu> wrote:
> 
> My *best guess* (please correct me if I'm wrong) is that the master (machine1) is sending the command to the worker (machine2) with the localhost argument as-is; that is, machine2 isn't doing any weird address conversion on its end.
> 
> Consequently, I've been focusing on the settings of the master/machine1. But I haven't found anything to indicate where the localhost argument could be coming from. /etc/hosts lists only 127.0.0.1 as localhost; spark-defaults.conf list spark.master as the full IP address (not 127.0.0.1); spark-env.sh on the master also lists the full IP under SPARK_MASTER_IP. The *only* place on the master where it's associated with localhost is SPARK_LOCAL_IP.
> 
> In looking at the logs of the worker spawned on master, it's also receiving a "spark://localhost:5060" argument, but since it resides on the master that works fine. Is it possible that the master is, for some reason, passing "spark://{SPARK_LOCAL_IP}:5060" to the workers?
> 
> That was my motivation behind commenting out SPARK_LOCAL_IP;     however, that's when the master crashes immediately due to the address already being in use.
> 
> Any ideas? Thanks!
> 
> Shannon
> 
>> On 6/26/14, 10:14 AM, Akhil Das wrote:
>> Can you paste your spark-env.sh file?
>> 
>> Thanks
>> Best Regards
>> 
>> 
>>> On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>>> Both /etc/hosts have each other's IP addresses in them. Telneting from machine2 to machine1 on port 5060 works just fine.
>>> 
>>> Here's the output of lsof:
>>> 
>>> user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
>>> COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
>>> java    23985 user   30u  IPv6 11092354      0t0  TCP machine1:sip (LISTEN)
>>> java    23985 user   40u  IPv6 11099560      0t0  TCP machine1:sip->machine1:48315 (ESTABLISHED)
>>> java    23985 user   52u  IPv6 11100405      0t0  TCP machine1:sip->machine2:54476 (ESTABLISHED)
>>> java    24157 user   40u  IPv6 11092413      0t0  TCP machine1:48315->machine1:sip (ESTABLISHED)
>>> 
>>> Ubuntu seems to recognize 5060 as the standard port for "sip"; it's not actually running anything there besides Spark, it just does a s/5060/sip/g.
>>> 
>>> Is there something to the fact that every time I comment out SPARK_LOCAL_IP in spark-env, it crashes immediately upon spark-submit due to the "address already being in use"? Or am I barking up the wrong tree on that one?
>>> 
>>> Thanks again for all your help; I hope we can knock this one out.
>>> 
>>> Shannon
>>> 
>>> 
>>>> On 6/26/14, 9:13 AM, Akhil Das wrote:
>>>> Do you have <ip>            machine1 in your workers /etc/hosts also? If so try telneting from your machine2 to machine1 on port 5060. Also make sure nothing else is running on port 5060 other                           than Spark (lsof -i:5060)
>>>> 
>>>> Thanks
>>>> Best Regards
>>>> 
>>>> 
>>>>> On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>>>>> Still running into the same problem. /etc/hosts on the master says
>>>>> 
>>>>> 127.0.0.1    localhost
>>>>> <ip>            machine1
>>>>> 
>>>>> <ip> is the same address set in spark-env.sh for SPARK_MASTER_IP. Any other ideas?
>>>>> 
>>>>> 
>>>>>> On 6/26/14, 3:11 AM, Akhil Das wrote:
>>>>>> Hi Shannon,
>>>>>> 
>>>>>> It should be a configuration issue, check in your /etc/hosts and make sure localhost is not associated with the SPARK_MASTER_IP you provided.
>>>>>> 
>>>>>> Thanks
>>>>>> Best Regards
>>>>>> 
>>>>>> 
>>>>>> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <sq...@gatech.edu> wrote:
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I have a 2-machine Spark network I've set up: a master and worker on machine1, and worker on machine2. When I run 'sbin/start-all.sh', everything starts up as it should. I see both workers                                           listed on the UI page. The logs of both workers indicate successful registration with the Spark master.
>>>>>>> 
>>>>>>> The problems begin when I attempt to submit a job: I get an "address already in use" exception that crashes the program. It says "Failed to bind to " and lists the exact port and address of the master.
>>>>>>> 
>>>>>>> At this point, the only items I have set in my spark-env.sh are SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
>>>>>>> 
>>>>>>> The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the master to 127.0.0.1. This allows the master to successfully send out the jobs; however, it ends up canceling the stage after running this command several times:
>>>>>>> 
>>>>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added: app-20140625210032-0000/8 on worker-20140625205623-machine2-53597 (machine2:53597) with 8 cores
>>>>>>> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140625210032-0000/8 on hostPort machine2:53597 with 8 cores, 8.0 GB RAM
>>>>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated: app-20140625210032-0000/8 is now RUNNING
>>>>>>> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated: app-20140625210032-0000/8 is now FAILED (Command exited with code 1)
>>>>>>> 
>>>>>>> The "/8" started at "/1", eventually becomes "/9", and then "/10", at which point the program crashes. The worker on machine2 shows similar messages in its logs. Here are the last bunch:
>>>>>>> 
>>>>>>> 14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9 finished with state FAILED message Command exited with code 1 exitStatus 1
>>>>>>> 14/06/25 21:00:31 INFO Worker: Asked to launch executor app-20140625210032-0000/10 for app_name
>>>>>>> Spark assembly has been built with Hive, including Datanucleus jars on classpath
>>>>>>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java" "-cp" "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar" "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler" "10" "machine2" "8" "akka.tcp://sparkWorker@machine2:53597/user/Worker" "app-20140625210032-0000"
>>>>>>> 14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10                                           finished with state FAILED message Command exited with code 1 exitStatus 1
>>>>>>> 
>>>>>>> I highlighted the part that seemed strange to me; that's the master port number (I set it to 5060), and yet it's referencing localhost? Is this the reason why machine2 apparently can't seem to give a confirmation to the master once the job is submitted? (The logs from the worker on the master node indicate that it's running just fine)
>>>>>>> 
>>>>>>> I appreciate any assistance you can offer!
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Shannon Quinn
>

Re: Spark standalone network configuration problems

Posted by Shannon Quinn <sq...@gatech.edu>.

My *best guess* (please correct me if I'm wrong) is that the master 
(machine1) is sending the command to the worker (machine2) with the 
localhost argument as-is; that is, machine2 isn't doing any weird 
address conversion on its end.

Consequently, I've been focusing on the settings of the master/machine1. 
But I haven't found anything to indicate where the localhost argument 
could be coming from. /etc/hosts lists only 127.0.0.1 as localhost; 
spark-defaults.conf list spark.master as the full IP address (not 
127.0.0.1); spark-env.sh on the master also lists the full IP under 
SPARK_MASTER_IP. The *only* place on the master where it's associated 
with localhost is SPARK_LOCAL_IP.

In looking at the logs of the worker spawned on master, it's also 
receiving a "spark://localhost:5060" argument, but since it resides on 
the master that works fine. Is it possible that the master is, for some 
reason, passing "spark://{SPARK_LOCAL_IP}:5060" to the workers?

That was my motivation behind commenting out SPARK_LOCAL_IP; however, 
that's when the master crashes immediately due to the address already 
being in use.

Any ideas? Thanks!

Shannon

On 6/26/14, 10:14 AM, Akhil Das wrote:
> Can you paste your spark-env.sh file?
>
> Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn <squinn@gatech.edu 
> <ma...@gatech.edu>> wrote:
>
>     Both /etc/hosts have each other's IP addresses in them. Telneting
>     from machine2 to machine1 on port 5060 works just fine.
>
>     Here's the output of lsof:
>
>     user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
>     COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
>     java    23985 user   30u  IPv6 11092354      0t0  TCP machine1:sip
>     (LISTEN)
>     java    23985 user   40u  IPv6 11099560      0t0  TCP
>     machine1:sip->machine1:48315 (ESTABLISHED)
>     java    23985 user   52u  IPv6 11100405      0t0  TCP
>     machine1:sip->machine2:54476 (ESTABLISHED)
>     java    24157 user   40u  IPv6 11092413      0t0  TCP
>     machine1:48315->machine1:sip (ESTABLISHED)
>
>     Ubuntu seems to recognize 5060 as the standard port for "sip";
>     it's not actually running anything there besides Spark, it just
>     does a s/5060/sip/g.
>
>     Is there something to the fact that every time I comment out
>     SPARK_LOCAL_IP in spark-env, it crashes immediately upon
>     spark-submit due to the "address already being in use"? Or am I
>     barking up the wrong tree on that one?
>
>     Thanks again for all your help; I hope we can knock this one out.
>
>     Shannon
>
>
>     On 6/26/14, 9:13 AM, Akhil Das wrote:
>>     Do you have <ip>         machine1 in your workers /etc/hosts
>>     also? If so try telneting from your machine2 to machine1 on port
>>     5060. Also make sure nothing else is running on port 5060 other
>>     than Spark (*/lsof -i:5060/*)
>>
>>     Thanks
>>     Best Regards
>>
>>
>>     On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <squinn@gatech.edu
>>     <ma...@gatech.edu>> wrote:
>>
>>         Still running into the same problem. /etc/hosts on the master
>>         says
>>
>>         127.0.0.1    localhost
>>         <ip>            machine1
>>
>>         <ip> is the same address set in spark-env.sh for
>>         SPARK_MASTER_IP. Any other ideas?
>>
>>
>>         On 6/26/14, 3:11 AM, Akhil Das wrote:
>>>         Hi Shannon,
>>>
>>>         It should be a configuration issue, check in your /etc/hosts
>>>         and make sure localhost is not associated with the
>>>         SPARK_MASTER_IP you provided.
>>>
>>>         Thanks
>>>         Best Regards
>>>
>>>
>>>         On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn
>>>         <squinn@gatech.edu <ma...@gatech.edu>> wrote:
>>>
>>>             Hi all,
>>>
>>>             I have a 2-machine Spark network I've set up: a master
>>>             and worker on machine1, and worker on machine2. When I
>>>             run 'sbin/start-all.sh', everything starts up as it
>>>             should. I see both workers listed on the UI page. The
>>>             logs of both workers indicate successful registration
>>>             with the Spark master.
>>>
>>>             The problems begin when I attempt to submit a job: I get
>>>             an "address already in use" exception that crashes the
>>>             program. It says "Failed to bind to " and lists the
>>>             exact port and address of the master.
>>>
>>>             At this point, the only items I have set in my
>>>             spark-env.sh are SPARK_MASTER_IP and SPARK_MASTER_PORT
>>>             (non-standard, set to 5060).
>>>
>>>             The next step I took, then, was to explicitly set
>>>             SPARK_LOCAL_IP on the master to 127.0.0.1. This allows
>>>             the master to successfully send out the jobs; however,
>>>             it ends up canceling the stage after running this
>>>             command several times:
>>>
>>>             14/06/25 21:00:47 INFO AppClient$ClientActor: Executor
>>>             added: app-20140625210032-0000/8 on
>>>             worker-20140625205623-machine2-53597 (machine2:53597)
>>>             with 8 cores
>>>             14/06/25 21:00:47 INFO SparkDeploySchedulerBackend:
>>>             Granted executor ID app-20140625210032-0000/8 on
>>>             hostPort machine2:53597 with 8 cores, 8.0 GB RAM
>>>             14/06/25 21:00:47 INFO AppClient$ClientActor: Executor
>>>             updated: app-20140625210032-0000/8 is now RUNNING
>>>             14/06/25 21:00:49 INFO AppClient$ClientActor: Executor
>>>             updated: app-20140625210032-0000/8 is now FAILED
>>>             (Command exited with code 1)
>>>
>>>             The "/8" started at "/1", eventually becomes "/9", and
>>>             then "/10", at which point the program crashes. The
>>>             worker on machine2 shows similar messages in its logs.
>>>             Here are the last bunch:
>>>
>>>             14/06/25 21:00:31 INFO Worker: Executor
>>>             app-20140625210032-0000/9 finished with state FAILED
>>>             message Command exited with code 1 exitStatus 1
>>>             14/06/25 21:00:31 INFO Worker: Asked to launch executor
>>>             app-20140625210032-0000/10 for app_name
>>>             Spark assembly has been built with Hive, including
>>>             Datanucleus jars on classpath
>>>             14/06/25 21:00:32 INFO ExecutorRunner: Launch command:
>>>             "java" "-cp"
>>>             "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>>             "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>>>             "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>>>             "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
>>>             "10" "machine2" "8"
>>>             "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>>>             "app-20140625210032-0000"
>>>             14/06/25 21:00:33 INFO Worker: Executor
>>>             app-20140625210032-0000/10 finished with state FAILED
>>>             message Command exited with code 1 exitStatus 1
>>>
>>>             I highlighted the part that seemed strange to me; that's
>>>             the master port number (I set it to 5060), and yet it's
>>>             referencing localhost? Is this the reason why machine2
>>>             apparently can't seem to give a confirmation to the
>>>             master once the job is submitted? (The logs from the
>>>             worker on the master node indicate that it's running
>>>             just fine)
>>>
>>>             I appreciate any assistance you can offer!
>>>
>>>             Regards,
>>>             Shannon Quinn
>>>
>>>
>>
>>
>
>

Re: Spark standalone network configuration problems

Posted by Shannon Quinn <sq...@gatech.edu>.

export SPARK_MASTER_IP="192.168.1.101"
export SPARK_MASTER_PORT="5060"
export SPARK_LOCAL_IP="127.0.0.1"

That's it. If I comment out the SPARK_LOCAL_IP or set it to be the same 
as SPARK_MASTER_IP, that's when it throws the "address already in use" 
error. If I leave it as the localhost IP, that's when I get the 
communication errors with machine2 that ultimately lead to the job failure.

Thanks!

Shannon

On 6/26/14, 10:14 AM, Akhil Das wrote:
> Can you paste your spark-env.sh file?
>
> Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn <squinn@gatech.edu 
> <ma...@gatech.edu>> wrote:
>
>     Both /etc/hosts have each other's IP addresses in them. Telneting
>     from machine2 to machine1 on port 5060 works just fine.
>
>     Here's the output of lsof:
>
>     user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
>     COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
>     java    23985 user   30u  IPv6 11092354      0t0  TCP machine1:sip
>     (LISTEN)
>     java    23985 user   40u  IPv6 11099560      0t0  TCP
>     machine1:sip->machine1:48315 (ESTABLISHED)
>     java    23985 user   52u  IPv6 11100405      0t0  TCP
>     machine1:sip->machine2:54476 (ESTABLISHED)
>     java    24157 user   40u  IPv6 11092413      0t0  TCP
>     machine1:48315->machine1:sip (ESTABLISHED)
>
>     Ubuntu seems to recognize 5060 as the standard port for "sip";
>     it's not actually running anything there besides Spark, it just
>     does a s/5060/sip/g.
>
>     Is there something to the fact that every time I comment out
>     SPARK_LOCAL_IP in spark-env, it crashes immediately upon
>     spark-submit due to the "address already being in use"? Or am I
>     barking up the wrong tree on that one?
>
>     Thanks again for all your help; I hope we can knock this one out.
>
>     Shannon
>
>
>     On 6/26/14, 9:13 AM, Akhil Das wrote:
>>     Do you have <ip>         machine1 in your workers /etc/hosts
>>     also? If so try telneting from your machine2 to machine1 on port
>>     5060. Also make sure nothing else is running on port 5060 other
>>     than Spark (*/lsof -i:5060/*)
>>
>>     Thanks
>>     Best Regards
>>
>>
>>     On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <squinn@gatech.edu
>>     <ma...@gatech.edu>> wrote:
>>
>>         Still running into the same problem. /etc/hosts on the master
>>         says
>>
>>         127.0.0.1    localhost
>>         <ip>            machine1
>>
>>         <ip> is the same address set in spark-env.sh for
>>         SPARK_MASTER_IP. Any other ideas?
>>
>>
>>         On 6/26/14, 3:11 AM, Akhil Das wrote:
>>>         Hi Shannon,
>>>
>>>         It should be a configuration issue, check in your /etc/hosts
>>>         and make sure localhost is not associated with the
>>>         SPARK_MASTER_IP you provided.
>>>
>>>         Thanks
>>>         Best Regards
>>>
>>>
>>>         On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn
>>>         <squinn@gatech.edu <ma...@gatech.edu>> wrote:
>>>
>>>             Hi all,
>>>
>>>             I have a 2-machine Spark network I've set up: a master
>>>             and worker on machine1, and worker on machine2. When I
>>>             run 'sbin/start-all.sh', everything starts up as it
>>>             should. I see both workers listed on the UI page. The
>>>             logs of both workers indicate successful registration
>>>             with the Spark master.
>>>
>>>             The problems begin when I attempt to submit a job: I get
>>>             an "address already in use" exception that crashes the
>>>             program. It says "Failed to bind to " and lists the
>>>             exact port and address of the master.
>>>
>>>             At this point, the only items I have set in my
>>>             spark-env.sh are SPARK_MASTER_IP and SPARK_MASTER_PORT
>>>             (non-standard, set to 5060).
>>>
>>>             The next step I took, then, was to explicitly set
>>>             SPARK_LOCAL_IP on the master to 127.0.0.1. This allows
>>>             the master to successfully send out the jobs; however,
>>>             it ends up canceling the stage after running this
>>>             command several times:
>>>
>>>             14/06/25 21:00:47 INFO AppClient$ClientActor: Executor
>>>             added: app-20140625210032-0000/8 on
>>>             worker-20140625205623-machine2-53597 (machine2:53597)
>>>             with 8 cores
>>>             14/06/25 21:00:47 INFO SparkDeploySchedulerBackend:
>>>             Granted executor ID app-20140625210032-0000/8 on
>>>             hostPort machine2:53597 with 8 cores, 8.0 GB RAM
>>>             14/06/25 21:00:47 INFO AppClient$ClientActor: Executor
>>>             updated: app-20140625210032-0000/8 is now RUNNING
>>>             14/06/25 21:00:49 INFO AppClient$ClientActor: Executor
>>>             updated: app-20140625210032-0000/8 is now FAILED
>>>             (Command exited with code 1)
>>>
>>>             The "/8" started at "/1", eventually becomes "/9", and
>>>             then "/10", at which point the program crashes. The
>>>             worker on machine2 shows similar messages in its logs.
>>>             Here are the last bunch:
>>>
>>>             14/06/25 21:00:31 INFO Worker: Executor
>>>             app-20140625210032-0000/9 finished with state FAILED
>>>             message Command exited with code 1 exitStatus 1
>>>             14/06/25 21:00:31 INFO Worker: Asked to launch executor
>>>             app-20140625210032-0000/10 for app_name
>>>             Spark assembly has been built with Hive, including
>>>             Datanucleus jars on classpath
>>>             14/06/25 21:00:32 INFO ExecutorRunner: Launch command:
>>>             "java" "-cp"
>>>             "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>>             "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>>>             "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>>>             "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
>>>             "10" "machine2" "8"
>>>             "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>>>             "app-20140625210032-0000"
>>>             14/06/25 21:00:33 INFO Worker: Executor
>>>             app-20140625210032-0000/10 finished with state FAILED
>>>             message Command exited with code 1 exitStatus 1
>>>
>>>             I highlighted the part that seemed strange to me; that's
>>>             the master port number (I set it to 5060), and yet it's
>>>             referencing localhost? Is this the reason why machine2
>>>             apparently can't seem to give a confirmation to the
>>>             master once the job is submitted? (The logs from the
>>>             worker on the master node indicate that it's running
>>>             just fine)
>>>
>>>             I appreciate any assistance you can offer!
>>>
>>>             Regards,
>>>             Shannon Quinn
>>>
>>>
>>
>>
>
>

Re: Spark standalone network configuration problems

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Can you paste your spark-env.sh file?

Thanks
Best Regards


On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn <sq...@gatech.edu> wrote:

>  Both /etc/hosts have each other's IP addresses in them. Telneting from
> machine2 to machine1 on port 5060 works just fine.
>
> Here's the output of lsof:
>
> user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
> COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
> java    23985 user   30u  IPv6 11092354      0t0  TCP machine1:sip (LISTEN)
> java    23985 user   40u  IPv6 11099560      0t0  TCP
> machine1:sip->machine1:48315 (ESTABLISHED)
> java    23985 user   52u  IPv6 11100405      0t0  TCP
> machine1:sip->machine2:54476 (ESTABLISHED)
> java    24157 user   40u  IPv6 11092413      0t0  TCP
> machine1:48315->machine1:sip (ESTABLISHED)
>
> Ubuntu seems to recognize 5060 as the standard port for "sip"; it's not
> actually running anything there besides Spark, it just does a s/5060/sip/g.
>
> Is there something to the fact that every time I comment out
> SPARK_LOCAL_IP in spark-env, it crashes immediately upon spark-submit due
> to the "address already being in use"? Or am I barking up the wrong tree on
> that one?
>
> Thanks again for all your help; I hope we can knock this one out.
>
> Shannon
>
>
> On 6/26/14, 9:13 AM, Akhil Das wrote:
>
>  Do you have <ip>            machine1 in your workers /etc/hosts also? If
> so try telneting from your machine2 to machine1 on port 5060. Also make
> sure nothing else is running on port 5060 other than Spark (*lsof -i:5060*
> )
>
>  Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>
>>  Still running into the same problem. /etc/hosts on the master says
>>
>> 127.0.0.1    localhost
>> <ip>            machine1
>>
>> <ip> is the same address set in spark-env.sh for SPARK_MASTER_IP. Any
>> other ideas?
>>
>>
>> On 6/26/14, 3:11 AM, Akhil Das wrote:
>>
>>  Hi Shannon,
>>
>>  It should be a configuration issue, check in your /etc/hosts and make
>> sure localhost is not associated with the SPARK_MASTER_IP you provided.
>>
>>  Thanks
>> Best Regards
>>
>>
>> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <sq...@gatech.edu> wrote:
>>
>>>  Hi all,
>>>
>>> I have a 2-machine Spark network I've set up: a master and worker on
>>> machine1, and worker on machine2. When I run 'sbin/start-all.sh',
>>> everything starts up as it should. I see both workers listed on the UI
>>> page. The logs of both workers indicate successful registration with the
>>> Spark master.
>>>
>>> The problems begin when I attempt to submit a job: I get an "address
>>> already in use" exception that crashes the program. It says "Failed to bind
>>> to " and lists the exact port and address of the master.
>>>
>>> At this point, the only items I have set in my spark-env.sh are
>>> SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
>>>
>>> The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the
>>> master to 127.0.0.1. This allows the master to successfully send out the
>>> jobs; however, it ends up canceling the stage after running this command
>>> several times:
>>>
>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added:
>>> app-20140625210032-0000/8 on worker-20140625205623-machine2-53597
>>> (machine2:53597) with 8 cores
>>> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID
>>> app-20140625210032-0000/8 on hostPort machine2:53597 with 8 cores, 8.0 GB
>>> RAM
>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated:
>>> app-20140625210032-0000/8 is now RUNNING
>>> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated:
>>> app-20140625210032-0000/8 is now FAILED (Command exited with code 1)
>>>
>>> The "/8" started at "/1", eventually becomes "/9", and then "/10", at
>>> which point the program crashes. The worker on machine2 shows similar
>>> messages in its logs. Here are the last bunch:
>>>
>>> 14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9
>>> finished with state FAILED message Command exited with code 1 exitStatus 1
>>> 14/06/25 21:00:31 INFO Worker: Asked to launch executor
>>> app-20140625210032-0000/10 for app_name
>>> Spark assembly has been built with Hive, including Datanucleus jars on
>>> classpath
>>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java" "-cp"
>>> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>>> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "
>>> *akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*" "10"
>>> "machine2" "8" "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>>> "app-20140625210032-0000"
>>> 14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10
>>> finished with state FAILED message Command exited with code 1 exitStatus 1
>>>
>>> I highlighted the part that seemed strange to me; that's the master port
>>> number (I set it to 5060), and yet it's referencing localhost? Is this the
>>> reason why machine2 apparently can't seem to give a confirmation to the
>>> master once the job is submitted? (The logs from the worker on the master
>>> node indicate that it's running just fine)
>>>
>>> I appreciate any assistance you can offer!
>>>
>>> Regards,
>>> Shannon Quinn
>>>
>>>
>>
>>
>
>

Re: Spark standalone network configuration problems

Posted by Shannon Quinn <sq...@gatech.edu>.

Both /etc/hosts have each other's IP addresses in them. Telneting from 
machine2 to machine1 on port 5060 works just fine.

Here's the output of lsof:

user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
COMMAND   PID   USER   FD TYPE   DEVICE SIZE/OFF NODE NAME
java    23985 user   30u  IPv6 11092354      0t0  TCP machine1:sip (LISTEN)
java    23985 user   40u  IPv6 11099560      0t0  TCP 
machine1:sip->machine1:48315 (ESTABLISHED)
java    23985 user   52u  IPv6 11100405      0t0  TCP 
machine1:sip->machine2:54476 (ESTABLISHED)
java    24157 user   40u  IPv6 11092413      0t0  TCP 
machine1:48315->machine1:sip (ESTABLISHED)

Ubuntu seems to recognize 5060 as the standard port for "sip"; it's not 
actually running anything there besides Spark, it just does a s/5060/sip/g.

Is there something to the fact that every time I comment out 
SPARK_LOCAL_IP in spark-env, it crashes immediately upon spark-submit 
due to the "address already being in use"? Or am I barking up the wrong 
tree on that one?

Thanks again for all your help; I hope we can knock this one out.

Shannon

On 6/26/14, 9:13 AM, Akhil Das wrote:
> Do you have <ip>         machine1 in your workers /etc/hosts also? If 
> so try telneting from your machine2 to machine1 on port 5060. Also 
> make sure nothing else is running on port 5060 other than Spark 
> (*/lsof -i:5060/*)
>
> Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <squinn@gatech.edu 
> <ma...@gatech.edu>> wrote:
>
>     Still running into the same problem. /etc/hosts on the master says
>
>     127.0.0.1    localhost
>     <ip>            machine1
>
>     <ip> is the same address set in spark-env.sh for SPARK_MASTER_IP.
>     Any other ideas?
>
>
>     On 6/26/14, 3:11 AM, Akhil Das wrote:
>>     Hi Shannon,
>>
>>     It should be a configuration issue, check in your /etc/hosts and
>>     make sure localhost is not associated with the SPARK_MASTER_IP
>>     you provided.
>>
>>     Thanks
>>     Best Regards
>>
>>
>>     On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <squinn@gatech.edu
>>     <ma...@gatech.edu>> wrote:
>>
>>         Hi all,
>>
>>         I have a 2-machine Spark network I've set up: a master and
>>         worker on machine1, and worker on machine2. When I run
>>         'sbin/start-all.sh', everything starts up as it should. I see
>>         both workers listed on the UI page. The logs of both workers
>>         indicate successful registration with the Spark master.
>>
>>         The problems begin when I attempt to submit a job: I get an
>>         "address already in use" exception that crashes the program.
>>         It says "Failed to bind to " and lists the exact port and
>>         address of the master.
>>
>>         At this point, the only items I have set in my spark-env.sh
>>         are SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set
>>         to 5060).
>>
>>         The next step I took, then, was to explicitly set
>>         SPARK_LOCAL_IP on the master to 127.0.0.1. This allows the
>>         master to successfully send out the jobs; however, it ends up
>>         canceling the stage after running this command several times:
>>
>>         14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added:
>>         app-20140625210032-0000/8 on
>>         worker-20140625205623-machine2-53597 (machine2:53597) with 8
>>         cores
>>         14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted
>>         executor ID app-20140625210032-0000/8 on hostPort
>>         machine2:53597 with 8 cores, 8.0 GB RAM
>>         14/06/25 21:00:47 INFO AppClient$ClientActor: Executor
>>         updated: app-20140625210032-0000/8 is now RUNNING
>>         14/06/25 21:00:49 INFO AppClient$ClientActor: Executor
>>         updated: app-20140625210032-0000/8 is now FAILED (Command
>>         exited with code 1)
>>
>>         The "/8" started at "/1", eventually becomes "/9", and then
>>         "/10", at which point the program crashes. The worker on
>>         machine2 shows similar messages in its logs. Here are the
>>         last bunch:
>>
>>         14/06/25 21:00:31 INFO Worker: Executor
>>         app-20140625210032-0000/9 finished with state FAILED message
>>         Command exited with code 1 exitStatus 1
>>         14/06/25 21:00:31 INFO Worker: Asked to launch executor
>>         app-20140625210032-0000/10 for app_name
>>         Spark assembly has been built with Hive, including
>>         Datanucleus jars on classpath
>>         14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java"
>>         "-cp"
>>         "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>         "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>>         "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>>         "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
>>         "10" "machine2" "8"
>>         "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>>         "app-20140625210032-0000"
>>         14/06/25 21:00:33 INFO Worker: Executor
>>         app-20140625210032-0000/10 finished with state FAILED message
>>         Command exited with code 1 exitStatus 1
>>
>>         I highlighted the part that seemed strange to me; that's the
>>         master port number (I set it to 5060), and yet it's
>>         referencing localhost? Is this the reason why machine2
>>         apparently can't seem to give a confirmation to the master
>>         once the job is submitted? (The logs from the worker on the
>>         master node indicate that it's running just fine)
>>
>>         I appreciate any assistance you can offer!
>>
>>         Regards,
>>         Shannon Quinn
>>
>>
>
>

Re: Spark standalone network configuration problems

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Do you have <ip>            machine1 in your workers /etc/hosts also? If so
try telneting from your machine2 to machine1 on port 5060. Also make sure
nothing else is running on port 5060 other than Spark (*lsof -i:5060*)

Thanks
Best Regards


On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <sq...@gatech.edu> wrote:

>  Still running into the same problem. /etc/hosts on the master says
>
> 127.0.0.1    localhost
> <ip>            machine1
>
> <ip> is the same address set in spark-env.sh for SPARK_MASTER_IP. Any
> other ideas?
>
>
> On 6/26/14, 3:11 AM, Akhil Das wrote:
>
>  Hi Shannon,
>
>  It should be a configuration issue, check in your /etc/hosts and make
> sure localhost is not associated with the SPARK_MASTER_IP you provided.
>
>  Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <sq...@gatech.edu> wrote:
>
>>  Hi all,
>>
>> I have a 2-machine Spark network I've set up: a master and worker on
>> machine1, and worker on machine2. When I run 'sbin/start-all.sh',
>> everything starts up as it should. I see both workers listed on the UI
>> page. The logs of both workers indicate successful registration with the
>> Spark master.
>>
>> The problems begin when I attempt to submit a job: I get an "address
>> already in use" exception that crashes the program. It says "Failed to bind
>> to " and lists the exact port and address of the master.
>>
>> At this point, the only items I have set in my spark-env.sh are
>> SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
>>
>> The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the
>> master to 127.0.0.1. This allows the master to successfully send out the
>> jobs; however, it ends up canceling the stage after running this command
>> several times:
>>
>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added:
>> app-20140625210032-0000/8 on worker-20140625205623-machine2-53597
>> (machine2:53597) with 8 cores
>> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID
>> app-20140625210032-0000/8 on hostPort machine2:53597 with 8 cores, 8.0 GB
>> RAM
>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated:
>> app-20140625210032-0000/8 is now RUNNING
>> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated:
>> app-20140625210032-0000/8 is now FAILED (Command exited with code 1)
>>
>> The "/8" started at "/1", eventually becomes "/9", and then "/10", at
>> which point the program crashes. The worker on machine2 shows similar
>> messages in its logs. Here are the last bunch:
>>
>> 14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9
>> finished with state FAILED message Command exited with code 1 exitStatus 1
>> 14/06/25 21:00:31 INFO Worker: Asked to launch executor
>> app-20140625210032-0000/10 for app_name
>> Spark assembly has been built with Hive, including Datanucleus jars on
>> classpath
>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java" "-cp"
>> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "
>> *akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*" "10"
>> "machine2" "8" "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>> "app-20140625210032-0000"
>> 14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10
>> finished with state FAILED message Command exited with code 1 exitStatus 1
>>
>> I highlighted the part that seemed strange to me; that's the master port
>> number (I set it to 5060), and yet it's referencing localhost? Is this the
>> reason why machine2 apparently can't seem to give a confirmation to the
>> master once the job is submitted? (The logs from the worker on the master
>> node indicate that it's running just fine)
>>
>> I appreciate any assistance you can offer!
>>
>> Regards,
>> Shannon Quinn
>>
>>
>
>

Re: Spark standalone network configuration problems

Posted by Shannon Quinn <sq...@gatech.edu>.

Still running into the same problem. /etc/hosts on the master says

127.0.0.1    localhost
<ip>            machine1

<ip> is the same address set in spark-env.sh for SPARK_MASTER_IP. Any 
other ideas?

On 6/26/14, 3:11 AM, Akhil Das wrote:
> Hi Shannon,
>
> It should be a configuration issue, check in your /etc/hosts and make 
> sure localhost is not associated with the SPARK_MASTER_IP you provided.
>
> Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <squinn@gatech.edu 
> <ma...@gatech.edu>> wrote:
>
>     Hi all,
>
>     I have a 2-machine Spark network I've set up: a master and worker
>     on machine1, and worker on machine2. When I run
>     'sbin/start-all.sh', everything starts up as it should. I see both
>     workers listed on the UI page. The logs of both workers indicate
>     successful registration with the Spark master.
>
>     The problems begin when I attempt to submit a job: I get an
>     "address already in use" exception that crashes the program. It
>     says "Failed to bind to " and lists the exact port and address of
>     the master.
>
>     At this point, the only items I have set in my spark-env.sh are
>     SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
>
>     The next step I took, then, was to explicitly set SPARK_LOCAL_IP
>     on the master to 127.0.0.1. This allows the master to successfully
>     send out the jobs; however, it ends up canceling the stage after
>     running this command several times:
>
>     14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added:
>     app-20140625210032-0000/8 on worker-20140625205623-machine2-53597
>     (machine2:53597) with 8 cores
>     14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted
>     executor ID app-20140625210032-0000/8 on hostPort machine2:53597
>     with 8 cores, 8.0 GB RAM
>     14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated:
>     app-20140625210032-0000/8 is now RUNNING
>     14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated:
>     app-20140625210032-0000/8 is now FAILED (Command exited with code 1)
>
>     The "/8" started at "/1", eventually becomes "/9", and then "/10",
>     at which point the program crashes. The worker on machine2 shows
>     similar messages in its logs. Here are the last bunch:
>
>     14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9
>     finished with state FAILED message Command exited with code 1
>     exitStatus 1
>     14/06/25 21:00:31 INFO Worker: Asked to launch executor
>     app-20140625210032-0000/10 for app_name
>     Spark assembly has been built with Hive, including Datanucleus
>     jars on classpath
>     14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java"
>     "-cp"
>     "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>     "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>     "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>     "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
>     "10" "machine2" "8"
>     "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>     "app-20140625210032-0000"
>     14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10
>     finished with state FAILED message Command exited with code 1
>     exitStatus 1
>
>     I highlighted the part that seemed strange to me; that's the
>     master port number (I set it to 5060), and yet it's referencing
>     localhost? Is this the reason why machine2 apparently can't seem
>     to give a confirmation to the master once the job is submitted?
>     (The logs from the worker on the master node indicate that it's
>     running just fine)
>
>     I appreciate any assistance you can offer!
>
>     Regards,
>     Shannon Quinn
>
>

Re: Spark standalone network configuration problems

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Hi Shannon,

It should be a configuration issue, check in your /etc/hosts and make sure
localhost is not associated with the SPARK_MASTER_IP you provided.

Thanks
Best Regards


On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <sq...@gatech.edu> wrote:

>  Hi all,
>
> I have a 2-machine Spark network I've set up: a master and worker on
> machine1, and worker on machine2. When I run 'sbin/start-all.sh',
> everything starts up as it should. I see both workers listed on the UI
> page. The logs of both workers indicate successful registration with the
> Spark master.
>
> The problems begin when I attempt to submit a job: I get an "address
> already in use" exception that crashes the program. It says "Failed to bind
> to " and lists the exact port and address of the master.
>
> At this point, the only items I have set in my spark-env.sh are
> SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
>
> The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the
> master to 127.0.0.1. This allows the master to successfully send out the
> jobs; however, it ends up canceling the stage after running this command
> several times:
>
> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added:
> app-20140625210032-0000/8 on worker-20140625205623-machine2-53597
> (machine2:53597) with 8 cores
> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140625210032-0000/8 on hostPort machine2:53597 with 8 cores, 8.0 GB
> RAM
> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated:
> app-20140625210032-0000/8 is now RUNNING
> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated:
> app-20140625210032-0000/8 is now FAILED (Command exited with code 1)
>
> The "/8" started at "/1", eventually becomes "/9", and then "/10", at
> which point the program crashes. The worker on machine2 shows similar
> messages in its logs. Here are the last bunch:
>
> 14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9 finished
> with state FAILED message Command exited with code 1 exitStatus 1
> 14/06/25 21:00:31 INFO Worker: Asked to launch executor
> app-20140625210032-0000/10 for app_name
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java" "-cp"
> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "
> *akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*" "10"
> "machine2" "8" "akka.tcp://sparkWorker@machine2:53597/user/Worker"
> "app-20140625210032-0000"
> 14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10
> finished with state FAILED message Command exited with code 1 exitStatus 1
>
> I highlighted the part that seemed strange to me; that's the master port
> number (I set it to 5060), and yet it's referencing localhost? Is this the
> reason why machine2 apparently can't seem to give a confirmation to the
> master once the job is submitted? (The logs from the worker on the master
> node indicate that it's running just fine)
>
> I appreciate any assistance you can offer!
>
> Regards,
> Shannon Quinn
>
>