You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Shannon Quinn <sq...@gatech.edu> on 2014/06/26 03:07:34 UTC
Spark standalone network configuration problems
Hi all,
I have a 2-machine Spark network I've set up: a master and worker on
machine1, and worker on machine2. When I run 'sbin/start-all.sh',
everything starts up as it should. I see both workers listed on the UI
page. The logs of both workers indicate successful registration with the
Spark master.
The problems begin when I attempt to submit a job: I get an "address
already in use" exception that crashes the program. It says "Failed to
bind to " and lists the exact port and address of the master.
At this point, the only items I have set in my spark-env.sh are
SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the
master to 127.0.0.1. This allows the master to successfully send out the
jobs; however, it ends up canceling the stage after running this command
several times:
14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added:
app-20140625210032-0000/8 on worker-20140625205623-machine2-53597
(machine2:53597) with 8 cores
14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20140625210032-0000/8 on hostPort machine2:53597 with 8 cores, 8.0
GB RAM
14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated:
app-20140625210032-0000/8 is now RUNNING
14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated:
app-20140625210032-0000/8 is now FAILED (Command exited with code 1)
The "/8" started at "/1", eventually becomes "/9", and then "/10", at
which point the program crashes. The worker on machine2 shows similar
messages in its logs. Here are the last bunch:
14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9
finished with state FAILED message Command exited with code 1 exitStatus 1
14/06/25 21:00:31 INFO Worker: Asked to launch executor
app-20140625210032-0000/10 for app_name
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java" "-cp"
"::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
"-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
"org.apache.spark.executor.CoarseGrainedExecutorBackend"
"*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*" "10"
"machine2" "8" "akka.tcp://sparkWorker@machine2:53597/user/Worker"
"app-20140625210032-0000"
14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10
finished with state FAILED message Command exited with code 1 exitStatus 1
I highlighted the part that seemed strange to me; that's the master port
number (I set it to 5060), and yet it's referencing localhost? Is this
the reason why machine2 apparently can't seem to give a confirmation to
the master once the job is submitted? (The logs from the worker on the
master node indicate that it's running just fine)
I appreciate any assistance you can offer!
Regards,
Shannon Quinn
Re: Spark standalone network configuration problems
Posted by Shannon Quinn <sq...@gatech.edu>.
Sorry, master spark URL in the web UI is *spark://192.168.1.101:5060*,
exactly as configured.
On 6/27/14, 9:07 AM, Shannon Quinn wrote:
> I put the settings as you specified in spark-env.sh for the master.
> When I run start-all.sh, the web UI shows both the worker on the
> master (machine1) and the slave worker (machine2) as ALIVE and ready,
> with the master URL at spark://192.168.1.101. However, when I run
> spark-submit, it immediately crashes with
>
> py4j.protocol.Py4JJavaError14/06/27 09:01:32 ERROR Remoting: Remoting
> error: [Startup failed]
> akka.remote.RemoteTransportException: Startup failed
> [...]
> org.jboss.netty.channel.ChannelException: Failed to bind to
> /192.168.1.101:5060
> [...]
> java.net.BindException: Address already in use.
> [...]
>
> This seems entirely contrary to intuition; why would Spark be unable
> to bind to the exact IP:port set for the master?
>
> On 6/27/14, 1:54 AM, Akhil Das wrote:
>> Hi Shannon,
>>
>> How about a setting like the following? (just removed the quotes)
>>
>> export SPARK_MASTER_IP=192.168.1.101
>> export SPARK_MASTER_PORT=5060
>> #export SPARK_LOCAL_IP=127.0.0.1
>>
>> Not sure whats happening in your case, it could be that your system
>> is not able to bind to 192.168.1.101 address. What is the spark://
>> master url that you are seeing there in the webUI? (It should be
>> spark://192.168.1.101:7077 in your case).
>>
>>
>>
>> Thanks
>> Best Regards
>>
>>
>> On Fri, Jun 27, 2014 at 5:47 AM, Shannon Quinn <squinn@gatech.edu
>> <ma...@gatech.edu>> wrote:
>>
>> In the interest of completeness, this is how I invoke spark:
>>
>> [on master]
>>
>> > sbin/start-all.sh
>> > spark-submit --py-files extra.py main.py
>>
>> iPhone'd
>>
>> On Jun 26, 2014, at 17:29, Shannon Quinn <squinn@gatech.edu
>> <ma...@gatech.edu>> wrote:
>>
>>> My *best guess* (please correct me if I'm wrong) is that the
>>> master (machine1) is sending the command to the worker
>>> (machine2) with the localhost argument as-is; that is, machine2
>>> isn't doing any weird address conversion on its end.
>>>
>>> Consequently, I've been focusing on the settings of the
>>> master/machine1. But I haven't found anything to indicate where
>>> the localhost argument could be coming from. /etc/hosts lists
>>> only 127.0.0.1 as localhost; spark-defaults.conf list
>>> spark.master as the full IP address (not 127.0.0.1);
>>> spark-env.sh on the master also lists the full IP under
>>> SPARK_MASTER_IP. The *only* place on the master where it's
>>> associated with localhost is SPARK_LOCAL_IP.
>>>
>>> In looking at the logs of the worker spawned on master, it's
>>> also receiving a "spark://localhost:5060" argument, but since it
>>> resides on the master that works fine. Is it possible that the
>>> master is, for some reason, passing
>>> "spark://{SPARK_LOCAL_IP}:5060" to the workers?
>>>
>>> That was my motivation behind commenting out SPARK_LOCAL_IP;
>>> however, that's when the master crashes immediately due to the
>>> address already being in use.
>>>
>>> Any ideas? Thanks!
>>>
>>> Shannon
>>>
>>> On 6/26/14, 10:14 AM, Akhil Das wrote:
>>>> Can you paste your spark-env.sh file?
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>>
>>>> On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn
>>>> <squinn@gatech.edu <ma...@gatech.edu>> wrote:
>>>>
>>>> Both /etc/hosts have each other's IP addresses in them.
>>>> Telneting from machine2 to machine1 on port 5060 works just
>>>> fine.
>>>>
>>>> Here's the output of lsof:
>>>>
>>>> user@machine1:~/spark/spark-1.0.0-bin-hadoop2$
>>>> <mailto:user@machine1:%7E/spark/spark-1.0.0-bin-hadoop2$>
>>>> lsof -i:5060
>>>> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
>>>> java 23985 user 30u IPv6 11092354 0t0 TCP
>>>> machine1:sip (LISTEN)
>>>> java 23985 user 40u IPv6 11099560 0t0 TCP
>>>> machine1:sip->machine1:48315 (ESTABLISHED)
>>>> java 23985 user 52u IPv6 11100405 0t0 TCP
>>>> machine1:sip->machine2:54476 (ESTABLISHED)
>>>> java 24157 user 40u IPv6 11092413 0t0 TCP
>>>> machine1:48315->machine1:sip (ESTABLISHED)
>>>>
>>>> Ubuntu seems to recognize 5060 as the standard port for
>>>> "sip"; it's not actually running anything there besides
>>>> Spark, it just does a s/5060/sip/g.
>>>>
>>>> Is there something to the fact that every time I comment
>>>> out SPARK_LOCAL_IP in spark-env, it crashes immediately
>>>> upon spark-submit due to the "address already being in
>>>> use"? Or am I barking up the wrong tree on that one?
>>>>
>>>> Thanks again for all your help; I hope we can knock this
>>>> one out.
>>>>
>>>> Shannon
>>>>
>>>>
>>>> On 6/26/14, 9:13 AM, Akhil Das wrote:
>>>>> Do you have <ip> machine1 in your workers
>>>>> /etc/hosts also? If so try telneting from your machine2 to
>>>>> machine1 on port 5060. Also make sure nothing else is
>>>>> running on port 5060 other than Spark (*/lsof -i:5060/*)
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>>
>>>>> On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn
>>>>> <squinn@gatech.edu <ma...@gatech.edu>> wrote:
>>>>>
>>>>> Still running into the same problem. /etc/hosts on the
>>>>> master says
>>>>>
>>>>> 127.0.0.1 localhost
>>>>> <ip> machine1
>>>>>
>>>>> <ip> is the same address set in spark-env.sh for
>>>>> SPARK_MASTER_IP. Any other ideas?
>>>>>
>>>>>
>>>>> On 6/26/14, 3:11 AM, Akhil Das wrote:
>>>>>> Hi Shannon,
>>>>>>
>>>>>> It should be a configuration issue, check in your
>>>>>> /etc/hosts and make sure localhost is not associated
>>>>>> with the SPARK_MASTER_IP you provided.
>>>>>>
>>>>>> Thanks
>>>>>> Best Regards
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn
>>>>>> <squinn@gatech.edu <ma...@gatech.edu>> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I have a 2-machine Spark network I've set up: a
>>>>>> master and worker on machine1, and worker on
>>>>>> machine2. When I run 'sbin/start-all.sh',
>>>>>> everything starts up as it should. I see both
>>>>>> workers listed on the UI page. The logs of both
>>>>>> workers indicate successful registration with the
>>>>>> Spark master.
>>>>>>
>>>>>> The problems begin when I attempt to submit a
>>>>>> job: I get an "address already in use" exception
>>>>>> that crashes the program. It says "Failed to bind
>>>>>> to " and lists the exact port and address of the
>>>>>> master.
>>>>>>
>>>>>> At this point, the only items I have set in my
>>>>>> spark-env.sh are SPARK_MASTER_IP and
>>>>>> SPARK_MASTER_PORT (non-standard, set to 5060).
>>>>>>
>>>>>> The next step I took, then, was to explicitly set
>>>>>> SPARK_LOCAL_IP on the master to 127.0.0.1. This
>>>>>> allows the master to successfully send out the
>>>>>> jobs; however, it ends up canceling the stage
>>>>>> after running this command several times:
>>>>>>
>>>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor:
>>>>>> Executor added: app-20140625210032-0000/8 on
>>>>>> worker-20140625205623-machine2-53597
>>>>>> (machine2:53597) with 8 cores
>>>>>> 14/06/25 21:00:47 INFO
>>>>>> SparkDeploySchedulerBackend: Granted executor ID
>>>>>> app-20140625210032-0000/8 on hostPort
>>>>>> machine2:53597 with 8 cores, 8.0 GB RAM
>>>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor:
>>>>>> Executor updated: app-20140625210032-0000/8 is
>>>>>> now RUNNING
>>>>>> 14/06/25 21:00:49 INFO AppClient$ClientActor:
>>>>>> Executor updated: app-20140625210032-0000/8 is
>>>>>> now FAILED (Command exited with code 1)
>>>>>>
>>>>>> The "/8" started at "/1", eventually becomes
>>>>>> "/9", and then "/10", at which point the program
>>>>>> crashes. The worker on machine2 shows similar
>>>>>> messages in its logs. Here are the last bunch:
>>>>>>
>>>>>> 14/06/25 21:00:31 INFO Worker: Executor
>>>>>> app-20140625210032-0000/9 finished with state
>>>>>> FAILED message Command exited with code 1
>>>>>> exitStatus 1
>>>>>> 14/06/25 21:00:31 INFO Worker: Asked to launch
>>>>>> executor app-20140625210032-0000/10 for app_name
>>>>>> Spark assembly has been built with Hive,
>>>>>> including Datanucleus jars on classpath
>>>>>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch
>>>>>> command: "java" "-cp"
>>>>>> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>>>>> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>>>>>> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>>>>>> "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
>>>>>> "10" "machine2" "8"
>>>>>> "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>>>>>> "app-20140625210032-0000"
>>>>>> 14/06/25 21:00:33 INFO Worker: Executor
>>>>>> app-20140625210032-0000/10 finished with state
>>>>>> FAILED message Command exited with code 1
>>>>>> exitStatus 1
>>>>>>
>>>>>> I highlighted the part that seemed strange to me;
>>>>>> that's the master port number (I set it to 5060),
>>>>>> and yet it's referencing localhost? Is this the
>>>>>> reason why machine2 apparently can't seem to give
>>>>>> a confirmation to the master once the job is
>>>>>> submitted? (The logs from the worker on the
>>>>>> master node indicate that it's running just fine)
>>>>>>
>>>>>> I appreciate any assistance you can offer!
>>>>>>
>>>>>> Regards,
>>>>>> Shannon Quinn
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>
Re: Spark standalone network configuration problems
Posted by Shannon Quinn <sq...@gatech.edu>.
I put the settings as you specified in spark-env.sh for the master. When
I run start-all.sh, the web UI shows both the worker on the master
(machine1) and the slave worker (machine2) as ALIVE and ready, with the
master URL at spark://192.168.1.101. However, when I run spark-submit,
it immediately crashes with
py4j.protocol.Py4JJavaError14/06/27 09:01:32 ERROR Remoting: Remoting
error: [Startup failed]
akka.remote.RemoteTransportException: Startup failed
[...]
org.jboss.netty.channel.ChannelException: Failed to bind to
/192.168.1.101:5060
[...]
java.net.BindException: Address already in use.
[...]
This seems entirely contrary to intuition; why would Spark be unable to
bind to the exact IP:port set for the master?
On 6/27/14, 1:54 AM, Akhil Das wrote:
> Hi Shannon,
>
> How about a setting like the following? (just removed the quotes)
>
> export SPARK_MASTER_IP=192.168.1.101
> export SPARK_MASTER_PORT=5060
> #export SPARK_LOCAL_IP=127.0.0.1
>
> Not sure whats happening in your case, it could be that your system is
> not able to bind to 192.168.1.101 address. What is the spark:// master
> url that you are seeing there in the webUI? (It should be
> spark://192.168.1.101:7077 in your case).
>
>
>
> Thanks
> Best Regards
>
>
> On Fri, Jun 27, 2014 at 5:47 AM, Shannon Quinn <squinn@gatech.edu
> <ma...@gatech.edu>> wrote:
>
> In the interest of completeness, this is how I invoke spark:
>
> [on master]
>
> > sbin/start-all.sh
> > spark-submit --py-files extra.py main.py
>
> iPhone'd
>
> On Jun 26, 2014, at 17:29, Shannon Quinn <squinn@gatech.edu
> <ma...@gatech.edu>> wrote:
>
>> My *best guess* (please correct me if I'm wrong) is that the
>> master (machine1) is sending the command to the worker (machine2)
>> with the localhost argument as-is; that is, machine2 isn't doing
>> any weird address conversion on its end.
>>
>> Consequently, I've been focusing on the settings of the
>> master/machine1. But I haven't found anything to indicate where
>> the localhost argument could be coming from. /etc/hosts lists
>> only 127.0.0.1 as localhost; spark-defaults.conf list
>> spark.master as the full IP address (not 127.0.0.1); spark-env.sh
>> on the master also lists the full IP under SPARK_MASTER_IP. The
>> *only* place on the master where it's associated with localhost
>> is SPARK_LOCAL_IP.
>>
>> In looking at the logs of the worker spawned on master, it's also
>> receiving a "spark://localhost:5060" argument, but since it
>> resides on the master that works fine. Is it possible that the
>> master is, for some reason, passing
>> "spark://{SPARK_LOCAL_IP}:5060" to the workers?
>>
>> That was my motivation behind commenting out SPARK_LOCAL_IP;
>> however, that's when the master crashes immediately due to the
>> address already being in use.
>>
>> Any ideas? Thanks!
>>
>> Shannon
>>
>> On 6/26/14, 10:14 AM, Akhil Das wrote:
>>> Can you paste your spark-env.sh file?
>>>
>>> Thanks
>>> Best Regards
>>>
>>>
>>> On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn
>>> <squinn@gatech.edu <ma...@gatech.edu>> wrote:
>>>
>>> Both /etc/hosts have each other's IP addresses in them.
>>> Telneting from machine2 to machine1 on port 5060 works just
>>> fine.
>>>
>>> Here's the output of lsof:
>>>
>>> user@machine1:~/spark/spark-1.0.0-bin-hadoop2$
>>> <mailto:user@machine1:%7E/spark/spark-1.0.0-bin-hadoop2$>
>>> lsof -i:5060
>>> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
>>> java 23985 user 30u IPv6 11092354 0t0 TCP
>>> machine1:sip (LISTEN)
>>> java 23985 user 40u IPv6 11099560 0t0 TCP
>>> machine1:sip->machine1:48315 (ESTABLISHED)
>>> java 23985 user 52u IPv6 11100405 0t0 TCP
>>> machine1:sip->machine2:54476 (ESTABLISHED)
>>> java 24157 user 40u IPv6 11092413 0t0 TCP
>>> machine1:48315->machine1:sip (ESTABLISHED)
>>>
>>> Ubuntu seems to recognize 5060 as the standard port for
>>> "sip"; it's not actually running anything there besides
>>> Spark, it just does a s/5060/sip/g.
>>>
>>> Is there something to the fact that every time I comment out
>>> SPARK_LOCAL_IP in spark-env, it crashes immediately upon
>>> spark-submit due to the "address already being in use"? Or
>>> am I barking up the wrong tree on that one?
>>>
>>> Thanks again for all your help; I hope we can knock this one
>>> out.
>>>
>>> Shannon
>>>
>>>
>>> On 6/26/14, 9:13 AM, Akhil Das wrote:
>>>> Do you have <ip> machine1 in your workers
>>>> /etc/hosts also? If so try telneting from your machine2 to
>>>> machine1 on port 5060. Also make sure nothing else is
>>>> running on port 5060 other than Spark (*/lsof -i:5060/*)
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>>
>>>> On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn
>>>> <squinn@gatech.edu <ma...@gatech.edu>> wrote:
>>>>
>>>> Still running into the same problem. /etc/hosts on the
>>>> master says
>>>>
>>>> 127.0.0.1 localhost
>>>> <ip> machine1
>>>>
>>>> <ip> is the same address set in spark-env.sh for
>>>> SPARK_MASTER_IP. Any other ideas?
>>>>
>>>>
>>>> On 6/26/14, 3:11 AM, Akhil Das wrote:
>>>>> Hi Shannon,
>>>>>
>>>>> It should be a configuration issue, check in your
>>>>> /etc/hosts and make sure localhost is not associated
>>>>> with the SPARK_MASTER_IP you provided.
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>>
>>>>> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn
>>>>> <squinn@gatech.edu <ma...@gatech.edu>> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a 2-machine Spark network I've set up: a
>>>>> master and worker on machine1, and worker on
>>>>> machine2. When I run 'sbin/start-all.sh',
>>>>> everything starts up as it should. I see both
>>>>> workers listed on the UI page. The logs of both
>>>>> workers indicate successful registration with the
>>>>> Spark master.
>>>>>
>>>>> The problems begin when I attempt to submit a job:
>>>>> I get an "address already in use" exception that
>>>>> crashes the program. It says "Failed to bind to "
>>>>> and lists the exact port and address of the master.
>>>>>
>>>>> At this point, the only items I have set in my
>>>>> spark-env.sh are SPARK_MASTER_IP and
>>>>> SPARK_MASTER_PORT (non-standard, set to 5060).
>>>>>
>>>>> The next step I took, then, was to explicitly set
>>>>> SPARK_LOCAL_IP on the master to 127.0.0.1. This
>>>>> allows the master to successfully send out the
>>>>> jobs; however, it ends up canceling the stage
>>>>> after running this command several times:
>>>>>
>>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor:
>>>>> Executor added: app-20140625210032-0000/8 on
>>>>> worker-20140625205623-machine2-53597
>>>>> (machine2:53597) with 8 cores
>>>>> 14/06/25 21:00:47 INFO
>>>>> SparkDeploySchedulerBackend: Granted executor ID
>>>>> app-20140625210032-0000/8 on hostPort
>>>>> machine2:53597 with 8 cores, 8.0 GB RAM
>>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor:
>>>>> Executor updated: app-20140625210032-0000/8 is now
>>>>> RUNNING
>>>>> 14/06/25 21:00:49 INFO AppClient$ClientActor:
>>>>> Executor updated: app-20140625210032-0000/8 is now
>>>>> FAILED (Command exited with code 1)
>>>>>
>>>>> The "/8" started at "/1", eventually becomes "/9",
>>>>> and then "/10", at which point the program
>>>>> crashes. The worker on machine2 shows similar
>>>>> messages in its logs. Here are the last bunch:
>>>>>
>>>>> 14/06/25 21:00:31 INFO Worker: Executor
>>>>> app-20140625210032-0000/9 finished with state
>>>>> FAILED message Command exited with code 1 exitStatus 1
>>>>> 14/06/25 21:00:31 INFO Worker: Asked to launch
>>>>> executor app-20140625210032-0000/10 for app_name
>>>>> Spark assembly has been built with Hive, including
>>>>> Datanucleus jars on classpath
>>>>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch
>>>>> command: "java" "-cp"
>>>>> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>>>> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>>>>> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>>>>> "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
>>>>> "10" "machine2" "8"
>>>>> "akka.tcp://sparkWorker@machine2:53597/user/Worker" "app-20140625210032-0000"
>>>>> 14/06/25 21:00:33 INFO Worker: Executor
>>>>> app-20140625210032-0000/10 finished with state
>>>>> FAILED message Command exited with code 1 exitStatus 1
>>>>>
>>>>> I highlighted the part that seemed strange to me;
>>>>> that's the master port number (I set it to 5060),
>>>>> and yet it's referencing localhost? Is this the
>>>>> reason why machine2 apparently can't seem to give
>>>>> a confirmation to the master once the job is
>>>>> submitted? (The logs from the worker on the master
>>>>> node indicate that it's running just fine)
>>>>>
>>>>> I appreciate any assistance you can offer!
>>>>>
>>>>> Regards,
>>>>> Shannon Quinn
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
Re: Spark standalone network configuration problems
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Hi Shannon,
How about a setting like the following? (just removed the quotes)
export SPARK_MASTER_IP=192.168.1.101
export SPARK_MASTER_PORT=5060
#export SPARK_LOCAL_IP=127.0.0.1
Not sure whats happening in your case, it could be that your system is not
able to bind to 192.168.1.101 address. What is the spark:// master url that
you are seeing there in the webUI? (It should be spark://192.168.1.101:7077
in your case).
Thanks
Best Regards
On Fri, Jun 27, 2014 at 5:47 AM, Shannon Quinn <sq...@gatech.edu> wrote:
> In the interest of completeness, this is how I invoke spark:
>
> [on master]
>
> > sbin/start-all.sh
> > spark-submit --py-files extra.py main.py
>
> iPhone'd
>
> On Jun 26, 2014, at 17:29, Shannon Quinn <sq...@gatech.edu> wrote:
>
> My *best guess* (please correct me if I'm wrong) is that the master
> (machine1) is sending the command to the worker (machine2) with the
> localhost argument as-is; that is, machine2 isn't doing any weird address
> conversion on its end.
>
> Consequently, I've been focusing on the settings of the master/machine1.
> But I haven't found anything to indicate where the localhost argument could
> be coming from. /etc/hosts lists only 127.0.0.1 as localhost;
> spark-defaults.conf list spark.master as the full IP address (not
> 127.0.0.1); spark-env.sh on the master also lists the full IP under
> SPARK_MASTER_IP. The *only* place on the master where it's associated with
> localhost is SPARK_LOCAL_IP.
>
> In looking at the logs of the worker spawned on master, it's also
> receiving a "spark://localhost:5060" argument, but since it resides on the
> master that works fine. Is it possible that the master is, for some reason,
> passing "spark://{SPARK_LOCAL_IP}:5060" to the workers?
>
> That was my motivation behind commenting out SPARK_LOCAL_IP; however,
> that's when the master crashes immediately due to the address already being
> in use.
>
> Any ideas? Thanks!
>
> Shannon
>
> On 6/26/14, 10:14 AM, Akhil Das wrote:
>
> Can you paste your spark-env.sh file?
>
> Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>
>> Both /etc/hosts have each other's IP addresses in them. Telneting from
>> machine2 to machine1 on port 5060 works just fine.
>>
>> Here's the output of lsof:
>>
>> user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
>> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
>> java 23985 user 30u IPv6 11092354 0t0 TCP machine1:sip
>> (LISTEN)
>> java 23985 user 40u IPv6 11099560 0t0 TCP
>> machine1:sip->machine1:48315 (ESTABLISHED)
>> java 23985 user 52u IPv6 11100405 0t0 TCP
>> machine1:sip->machine2:54476 (ESTABLISHED)
>> java 24157 user 40u IPv6 11092413 0t0 TCP
>> machine1:48315->machine1:sip (ESTABLISHED)
>>
>> Ubuntu seems to recognize 5060 as the standard port for "sip"; it's not
>> actually running anything there besides Spark, it just does a s/5060/sip/g.
>>
>> Is there something to the fact that every time I comment out
>> SPARK_LOCAL_IP in spark-env, it crashes immediately upon spark-submit due
>> to the "address already being in use"? Or am I barking up the wrong tree on
>> that one?
>>
>> Thanks again for all your help; I hope we can knock this one out.
>>
>> Shannon
>>
>>
>> On 6/26/14, 9:13 AM, Akhil Das wrote:
>>
>> Do you have <ip> machine1 in your workers /etc/hosts also?
>> If so try telneting from your machine2 to machine1 on port 5060. Also make
>> sure nothing else is running on port 5060 other than Spark (*lsof
>> -i:5060*)
>>
>> Thanks
>> Best Regards
>>
>>
>> On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>>
>>> Still running into the same problem. /etc/hosts on the master says
>>>
>>> 127.0.0.1 localhost
>>> <ip> machine1
>>>
>>> <ip> is the same address set in spark-env.sh for SPARK_MASTER_IP. Any
>>> other ideas?
>>>
>>>
>>> On 6/26/14, 3:11 AM, Akhil Das wrote:
>>>
>>> Hi Shannon,
>>>
>>> It should be a configuration issue, check in your /etc/hosts and make
>>> sure localhost is not associated with the SPARK_MASTER_IP you provided.
>>>
>>> Thanks
>>> Best Regards
>>>
>>>
>>> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <sq...@gatech.edu>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have a 2-machine Spark network I've set up: a master and worker on
>>>> machine1, and worker on machine2. When I run 'sbin/start-all.sh',
>>>> everything starts up as it should. I see both workers listed on the UI
>>>> page. The logs of both workers indicate successful registration with the
>>>> Spark master.
>>>>
>>>> The problems begin when I attempt to submit a job: I get an "address
>>>> already in use" exception that crashes the program. It says "Failed to bind
>>>> to " and lists the exact port and address of the master.
>>>>
>>>> At this point, the only items I have set in my spark-env.sh are
>>>> SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
>>>>
>>>> The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the
>>>> master to 127.0.0.1. This allows the master to successfully send out the
>>>> jobs; however, it ends up canceling the stage after running this command
>>>> several times:
>>>>
>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added:
>>>> app-20140625210032-0000/8 on worker-20140625205623-machine2-53597
>>>> (machine2:53597) with 8 cores
>>>> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID
>>>> app-20140625210032-0000/8 on hostPort machine2:53597 with 8 cores, 8.0 GB
>>>> RAM
>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated:
>>>> app-20140625210032-0000/8 is now RUNNING
>>>> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated:
>>>> app-20140625210032-0000/8 is now FAILED (Command exited with code 1)
>>>>
>>>> The "/8" started at "/1", eventually becomes "/9", and then "/10", at
>>>> which point the program crashes. The worker on machine2 shows similar
>>>> messages in its logs. Here are the last bunch:
>>>>
>>>> 14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9
>>>> finished with state FAILED message Command exited with code 1 exitStatus 1
>>>> 14/06/25 21:00:31 INFO Worker: Asked to launch executor
>>>> app-20140625210032-0000/10 for app_name
>>>> Spark assembly has been built with Hive, including Datanucleus jars on
>>>> classpath
>>>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java" "-cp"
>>>> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>>> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>>>> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "
>>>> *akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*" "10"
>>>> "machine2" "8" "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>>>> "app-20140625210032-0000"
>>>> 14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10
>>>> finished with state FAILED message Command exited with code 1 exitStatus 1
>>>>
>>>> I highlighted the part that seemed strange to me; that's the master
>>>> port number (I set it to 5060), and yet it's referencing localhost? Is this
>>>> the reason why machine2 apparently can't seem to give a confirmation to the
>>>> master once the job is submitted? (The logs from the worker on the master
>>>> node indicate that it's running just fine)
>>>>
>>>> I appreciate any assistance you can offer!
>>>>
>>>> Regards,
>>>> Shannon Quinn
>>>>
>>>>
>>>
>>>
>>
>>
>
>
Re: Spark standalone network configuration problems
Posted by Shannon Quinn <sq...@gatech.edu>.
No joy, unfortunately. Same issue; see my previous email--still crashes
with "address already in use."
On 6/27/14, 1:54 AM, sujeetv wrote:
> Try to explicitly set set the "spark.driver.host" property to the master's
> IP.
> Sujeet
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark standalone network configuration problems
Posted by Shannon Quinn <sq...@gatech.edu>.
Apologies; can you advise as to how I would check that? I can certainly
SSH from master to machine2.
On 6/27/14, 3:22 PM, Sujeet Varakhedi wrote:
> Looks like your driver is not able to connect to the remote executor
> on machine2/130.49.226.148:60949 <http://130.49.226.148:60949/>. Cn
> you check if the master machine can route to 130.49.226.148
>
> Sujeet
>
>
> On Fri, Jun 27, 2014 at 12:04 PM, Shannon Quinn <squinn@gatech.edu
> <ma...@gatech.edu>> wrote:
>
> For some reason, commenting out spark.driver.host and
> spark.driver.port fixed something...and broke something else (or
> at least revealed another problem). For reference, the only lines
> I have in my spark-defaults.conf now:
>
> spark.app.name <http://spark.app.name> myProg
> spark.master spark://192.168.1.101:5060
> <http://192.168.1.101:5060>
> spark.executor.memory 8g
> spark.files.overwrite true
>
> It starts up, but has problems with machine2. For some reason,
> machine2 is having trouble communicating with *itself*. Here are
> the worker logs of one of the failures (there are 10 before it
> quits):
>
>
> Spark assembly has been built with Hive, including Datanucleus
> jars on classpath
> 14/06/27 14:55:13 INFO ExecutorRunner: Launch command: "java"
> "-cp"
> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
> "akka.tcp://spark@machine1:46378/user/CoarseGrainedScheduler" "7"
> "machine2" "8" "akka.tcp://sparkWorker@machine2:48019/user/Worker"
> "app-20140627144512-0001"
> 14/06/27 14:56:54 INFO Worker: Executor app-20140627144512-0001/7
> finished with state FAILED message Command exited with code 1
> exitStatus 1
> 14/06/27 14:56:54 INFO LocalActorRef: Message
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
> from Actor[akka://sparkWorker/deadLetters] to
> Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40130.49.226.148%3A53561-38#-1924573003]
> was not delivered. [10] dead letters encountered. This logging can
> be turned off or adjusted with configuration settings
> 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@machine2:48019] ->
> [akka.tcp://sparkExecutor@machine2:60949]: Error [Association
> failed with [akka.tcp://sparkExecutor@machine2:60949]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@machine2:60949]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: machine2/130.49.226.148:60949
> <http://130.49.226.148:60949>
> ]
> 14/06/27 14:56:54 INFO Worker: Asked to launch executor
> app-20140627144512-0001/8 for Funtown, USA
> 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@machine2:48019] ->
> [akka.tcp://sparkExecutor@machine2:60949]: Error [Association
> failed with [akka.tcp://sparkExecutor@machine2:60949]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@machine2:60949]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: machine2/130.49.226.148:60949
> <http://130.49.226.148:60949>
> ]
> 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@machine2:48019] ->
> [akka.tcp://sparkExecutor@machine2:60949]: Error [Association
> failed with [akka.tcp://sparkExecutor@machine2:60949]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@machine2:60949]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: machine2/130.49.226.148:60949
> <http://130.49.226.148:60949>
> ]
>
> Port 48019 on machine2 is indeed open, connected, and listening.
> Any ideas?
>
> Thanks!
>
> Shannon
>
> On 6/27/14, 1:54 AM, sujeetv wrote:
>
> Try to explicitly set set the "spark.driver.host" property to
> the master's
> IP.
> Sujeet
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
>
>
Re: Spark standalone network configuration problems
Posted by Shannon Quinn <sq...@gatech.edu>.
I switched which machine was the master and which was the dedicated
worker, and now it works just fine. I discovered machine2 is on my
department's DMZ; machine1 is not. I suspect the departmental firewall
was causing problems. By moving the master to machine2, that seems to
have solved my problems.
Thank you all very much for your help. I'm sure I'll have other
questions soon :)
Regards,
Shannon
On 6/27/14, 3:22 PM, Sujeet Varakhedi wrote:
> Looks like your driver is not able to connect to the remote executor
> on machine2/130.49.226.148:60949 <http://130.49.226.148:60949/>. Cn
> you check if the master machine can route to 130.49.226.148
>
> Sujeet
>
>
> On Fri, Jun 27, 2014 at 12:04 PM, Shannon Quinn <squinn@gatech.edu
> <ma...@gatech.edu>> wrote:
>
> For some reason, commenting out spark.driver.host and
> spark.driver.port fixed something...and broke something else (or
> at least revealed another problem). For reference, the only lines
> I have in my spark-defaults.conf now:
>
> spark.app.name <http://spark.app.name> myProg
> spark.master spark://192.168.1.101:5060
> <http://192.168.1.101:5060>
> spark.executor.memory 8g
> spark.files.overwrite true
>
> It starts up, but has problems with machine2. For some reason,
> machine2 is having trouble communicating with *itself*. Here are
> the worker logs of one of the failures (there are 10 before it
> quits):
>
>
> Spark assembly has been built with Hive, including Datanucleus
> jars on classpath
> 14/06/27 14:55:13 INFO ExecutorRunner: Launch command: "java"
> "-cp"
> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
> "akka.tcp://spark@machine1:46378/user/CoarseGrainedScheduler" "7"
> "machine2" "8" "akka.tcp://sparkWorker@machine2:48019/user/Worker"
> "app-20140627144512-0001"
> 14/06/27 14:56:54 INFO Worker: Executor app-20140627144512-0001/7
> finished with state FAILED message Command exited with code 1
> exitStatus 1
> 14/06/27 14:56:54 INFO LocalActorRef: Message
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
> from Actor[akka://sparkWorker/deadLetters] to
> Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40130.49.226.148%3A53561-38#-1924573003]
> was not delivered. [10] dead letters encountered. This logging can
> be turned off or adjusted with configuration settings
> 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@machine2:48019] ->
> [akka.tcp://sparkExecutor@machine2:60949]: Error [Association
> failed with [akka.tcp://sparkExecutor@machine2:60949]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@machine2:60949]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: machine2/130.49.226.148:60949
> <http://130.49.226.148:60949>
> ]
> 14/06/27 14:56:54 INFO Worker: Asked to launch executor
> app-20140627144512-0001/8 for Funtown, USA
> 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@machine2:48019] ->
> [akka.tcp://sparkExecutor@machine2:60949]: Error [Association
> failed with [akka.tcp://sparkExecutor@machine2:60949]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@machine2:60949]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: machine2/130.49.226.148:60949
> <http://130.49.226.148:60949>
> ]
> 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@machine2:48019] ->
> [akka.tcp://sparkExecutor@machine2:60949]: Error [Association
> failed with [akka.tcp://sparkExecutor@machine2:60949]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@machine2:60949]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: machine2/130.49.226.148:60949
> <http://130.49.226.148:60949>
> ]
>
> Port 48019 on machine2 is indeed open, connected, and listening.
> Any ideas?
>
> Thanks!
>
> Shannon
>
> On 6/27/14, 1:54 AM, sujeetv wrote:
>
> Try to explicitly set set the "spark.driver.host" property to
> the master's
> IP.
> Sujeet
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
>
>
Re: Spark standalone network configuration problems
Posted by Sujeet Varakhedi <sv...@gopivotal.com>.
Looks like your driver is not able to connect to the remote executor on
machine2/130.49.226.148:60949. Cn you check if the master machine can
route to 130.49.226.148
Sujeet
On Fri, Jun 27, 2014 at 12:04 PM, Shannon Quinn <sq...@gatech.edu> wrote:
> For some reason, commenting out spark.driver.host and spark.driver.port
> fixed something...and broke something else (or at least revealed another
> problem). For reference, the only lines I have in my spark-defaults.conf
> now:
>
> spark.app.name myProg
> spark.master spark://192.168.1.101:5060
> spark.executor.memory 8g
> spark.files.overwrite true
>
> It starts up, but has problems with machine2. For some reason, machine2 is
> having trouble communicating with *itself*. Here are the worker logs of one
> of the failures (there are 10 before it quits):
>
>
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
> 14/06/27 14:55:13 INFO ExecutorRunner: Launch command: "java" "-cp"
> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/
> spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.
> 2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-
> rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/
> datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-
> hadoop2/lib/datanucleus-api-jdo-3.2.1.jar" "-XX:MaxPermSize=128m"
> "-Xms8192M" "-Xmx8192M" "org.apache.spark.executor.CoarseGrainedExecutorBackend"
> "akka.tcp://spark@machine1:46378/user/CoarseGrainedScheduler" "7"
> "machine2" "8" "akka.tcp://sparkWorker@machine2:48019/user/Worker"
> "app-20140627144512-0001"
> 14/06/27 14:56:54 INFO Worker: Executor app-20140627144512-0001/7 finished
> with state FAILED message Command exited with code 1 exitStatus 1
> 14/06/27 14:56:54 INFO LocalActorRef: Message [akka.remote.transport.
> ActorTransportAdapter$DisassociateUnderlying] from
> Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/
> system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%
> 2FsparkWorker%40130.49.226.148%3A53561-38#-1924573003] was not delivered.
> [10] dead letters encountered. This logging can be turned off or adjusted
> with configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@machine2:48019] -> [akka.tcp://sparkExecutor@machine2:60949]:
> Error [Association failed with [akka.tcp://sparkExecutor@machine2:60949]]
> [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@machine2:60949]
> Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: machine2/130.49.226.148:60949
> ]
> 14/06/27 14:56:54 INFO Worker: Asked to launch executor
> app-20140627144512-0001/8 for Funtown, USA
> 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@machine2:48019] -> [akka.tcp://sparkExecutor@machine2:60949]:
> Error [Association failed with [akka.tcp://sparkExecutor@machine2:60949]]
> [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@machine2:60949]
> Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: machine2/130.49.226.148:60949
> ]
> 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@machine2:48019] -> [akka.tcp://sparkExecutor@machine2:60949]:
> Error [Association failed with [akka.tcp://sparkExecutor@machine2:60949]]
> [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@machine2:60949]
> Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: machine2/130.49.226.148:60949
> ]
>
> Port 48019 on machine2 is indeed open, connected, and listening. Any ideas?
>
> Thanks!
>
> Shannon
>
> On 6/27/14, 1:54 AM, sujeetv wrote:
>
>> Try to explicitly set set the "spark.driver.host" property to the master's
>> IP.
>> Sujeet
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-
>> tp8304p8396.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
Re: Spark standalone network configuration problems
Posted by Shannon Quinn <sq...@gatech.edu>.
For some reason, commenting out spark.driver.host and spark.driver.port
fixed something...and broke something else (or at least revealed another
problem). For reference, the only lines I have in my spark-defaults.conf
now:
spark.app.name myProg
spark.master spark://192.168.1.101:5060
spark.executor.memory 8g
spark.files.overwrite true
It starts up, but has problems with machine2. For some reason, machine2
is having trouble communicating with *itself*. Here are the worker logs
of one of the failures (there are 10 before it quits):
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
14/06/27 14:55:13 INFO ExecutorRunner: Launch command: "java" "-cp"
"::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
"-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
"org.apache.spark.executor.CoarseGrainedExecutorBackend"
"akka.tcp://spark@machine1:46378/user/CoarseGrainedScheduler" "7"
"machine2" "8" "akka.tcp://sparkWorker@machine2:48019/user/Worker"
"app-20140627144512-0001"
14/06/27 14:56:54 INFO Worker: Executor app-20140627144512-0001/7
finished with state FAILED message Command exited with code 1 exitStatus 1
14/06/27 14:56:54 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
from Actor[akka://sparkWorker/deadLetters] to
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40130.49.226.148%3A53561-38#-1924573003]
was not delivered. [10] dead letters encountered. This logging can be
turned off or adjusted with configuration settings
'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@machine2:48019] ->
[akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed
with [akka.tcp://sparkExecutor@machine2:60949]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@machine2:60949]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: machine2/130.49.226.148:60949
]
14/06/27 14:56:54 INFO Worker: Asked to launch executor
app-20140627144512-0001/8 for Funtown, USA
14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@machine2:48019] ->
[akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed
with [akka.tcp://sparkExecutor@machine2:60949]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@machine2:60949]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: machine2/130.49.226.148:60949
]
14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@machine2:48019] ->
[akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed
with [akka.tcp://sparkExecutor@machine2:60949]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@machine2:60949]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: machine2/130.49.226.148:60949
]
Port 48019 on machine2 is indeed open, connected, and listening. Any ideas?
Thanks!
Shannon
On 6/27/14, 1:54 AM, sujeetv wrote:
> Try to explicitly set set the "spark.driver.host" property to the master's
> IP.
> Sujeet
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark standalone network configuration problems
Posted by sujeetv <sv...@gmail.com>.
Try to explicitly set set the "spark.driver.host" property to the master's
IP.
Sujeet
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark standalone network configuration problems
Posted by Shannon Quinn <sq...@gatech.edu>.
In the interest of completeness, this is how I invoke spark:
[on master]
> sbin/start-all.sh
> spark-submit --py-files extra.py main.py
iPhone'd
> On Jun 26, 2014, at 17:29, Shannon Quinn <sq...@gatech.edu> wrote:
>
> My *best guess* (please correct me if I'm wrong) is that the master (machine1) is sending the command to the worker (machine2) with the localhost argument as-is; that is, machine2 isn't doing any weird address conversion on its end.
>
> Consequently, I've been focusing on the settings of the master/machine1. But I haven't found anything to indicate where the localhost argument could be coming from. /etc/hosts lists only 127.0.0.1 as localhost; spark-defaults.conf list spark.master as the full IP address (not 127.0.0.1); spark-env.sh on the master also lists the full IP under SPARK_MASTER_IP. The *only* place on the master where it's associated with localhost is SPARK_LOCAL_IP.
>
> In looking at the logs of the worker spawned on master, it's also receiving a "spark://localhost:5060" argument, but since it resides on the master that works fine. Is it possible that the master is, for some reason, passing "spark://{SPARK_LOCAL_IP}:5060" to the workers?
>
> That was my motivation behind commenting out SPARK_LOCAL_IP; however, that's when the master crashes immediately due to the address already being in use.
>
> Any ideas? Thanks!
>
> Shannon
>
>> On 6/26/14, 10:14 AM, Akhil Das wrote:
>> Can you paste your spark-env.sh file?
>>
>> Thanks
>> Best Regards
>>
>>
>>> On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>>> Both /etc/hosts have each other's IP addresses in them. Telneting from machine2 to machine1 on port 5060 works just fine.
>>>
>>> Here's the output of lsof:
>>>
>>> user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
>>> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
>>> java 23985 user 30u IPv6 11092354 0t0 TCP machine1:sip (LISTEN)
>>> java 23985 user 40u IPv6 11099560 0t0 TCP machine1:sip->machine1:48315 (ESTABLISHED)
>>> java 23985 user 52u IPv6 11100405 0t0 TCP machine1:sip->machine2:54476 (ESTABLISHED)
>>> java 24157 user 40u IPv6 11092413 0t0 TCP machine1:48315->machine1:sip (ESTABLISHED)
>>>
>>> Ubuntu seems to recognize 5060 as the standard port for "sip"; it's not actually running anything there besides Spark, it just does a s/5060/sip/g.
>>>
>>> Is there something to the fact that every time I comment out SPARK_LOCAL_IP in spark-env, it crashes immediately upon spark-submit due to the "address already being in use"? Or am I barking up the wrong tree on that one?
>>>
>>> Thanks again for all your help; I hope we can knock this one out.
>>>
>>> Shannon
>>>
>>>
>>>> On 6/26/14, 9:13 AM, Akhil Das wrote:
>>>> Do you have <ip> machine1 in your workers /etc/hosts also? If so try telneting from your machine2 to machine1 on port 5060. Also make sure nothing else is running on port 5060 other than Spark (lsof -i:5060)
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>>
>>>>> On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>>>>> Still running into the same problem. /etc/hosts on the master says
>>>>>
>>>>> 127.0.0.1 localhost
>>>>> <ip> machine1
>>>>>
>>>>> <ip> is the same address set in spark-env.sh for SPARK_MASTER_IP. Any other ideas?
>>>>>
>>>>>
>>>>>> On 6/26/14, 3:11 AM, Akhil Das wrote:
>>>>>> Hi Shannon,
>>>>>>
>>>>>> It should be a configuration issue, check in your /etc/hosts and make sure localhost is not associated with the SPARK_MASTER_IP you provided.
>>>>>>
>>>>>> Thanks
>>>>>> Best Regards
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <sq...@gatech.edu> wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I have a 2-machine Spark network I've set up: a master and worker on machine1, and worker on machine2. When I run 'sbin/start-all.sh', everything starts up as it should. I see both workers listed on the UI page. The logs of both workers indicate successful registration with the Spark master.
>>>>>>>
>>>>>>> The problems begin when I attempt to submit a job: I get an "address already in use" exception that crashes the program. It says "Failed to bind to " and lists the exact port and address of the master.
>>>>>>>
>>>>>>> At this point, the only items I have set in my spark-env.sh are SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
>>>>>>>
>>>>>>> The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the master to 127.0.0.1. This allows the master to successfully send out the jobs; however, it ends up canceling the stage after running this command several times:
>>>>>>>
>>>>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added: app-20140625210032-0000/8 on worker-20140625205623-machine2-53597 (machine2:53597) with 8 cores
>>>>>>> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140625210032-0000/8 on hostPort machine2:53597 with 8 cores, 8.0 GB RAM
>>>>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated: app-20140625210032-0000/8 is now RUNNING
>>>>>>> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated: app-20140625210032-0000/8 is now FAILED (Command exited with code 1)
>>>>>>>
>>>>>>> The "/8" started at "/1", eventually becomes "/9", and then "/10", at which point the program crashes. The worker on machine2 shows similar messages in its logs. Here are the last bunch:
>>>>>>>
>>>>>>> 14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9 finished with state FAILED message Command exited with code 1 exitStatus 1
>>>>>>> 14/06/25 21:00:31 INFO Worker: Asked to launch executor app-20140625210032-0000/10 for app_name
>>>>>>> Spark assembly has been built with Hive, including Datanucleus jars on classpath
>>>>>>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java" "-cp" "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar" "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler" "10" "machine2" "8" "akka.tcp://sparkWorker@machine2:53597/user/Worker" "app-20140625210032-0000"
>>>>>>> 14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10 finished with state FAILED message Command exited with code 1 exitStatus 1
>>>>>>>
>>>>>>> I highlighted the part that seemed strange to me; that's the master port number (I set it to 5060), and yet it's referencing localhost? Is this the reason why machine2 apparently can't seem to give a confirmation to the master once the job is submitted? (The logs from the worker on the master node indicate that it's running just fine)
>>>>>>>
>>>>>>> I appreciate any assistance you can offer!
>>>>>>>
>>>>>>> Regards,
>>>>>>> Shannon Quinn
>
Re: Spark standalone network configuration problems
Posted by Shannon Quinn <sq...@gatech.edu>.
My *best guess* (please correct me if I'm wrong) is that the master
(machine1) is sending the command to the worker (machine2) with the
localhost argument as-is; that is, machine2 isn't doing any weird
address conversion on its end.
Consequently, I've been focusing on the settings of the master/machine1.
But I haven't found anything to indicate where the localhost argument
could be coming from. /etc/hosts lists only 127.0.0.1 as localhost;
spark-defaults.conf list spark.master as the full IP address (not
127.0.0.1); spark-env.sh on the master also lists the full IP under
SPARK_MASTER_IP. The *only* place on the master where it's associated
with localhost is SPARK_LOCAL_IP.
In looking at the logs of the worker spawned on master, it's also
receiving a "spark://localhost:5060" argument, but since it resides on
the master that works fine. Is it possible that the master is, for some
reason, passing "spark://{SPARK_LOCAL_IP}:5060" to the workers?
That was my motivation behind commenting out SPARK_LOCAL_IP; however,
that's when the master crashes immediately due to the address already
being in use.
Any ideas? Thanks!
Shannon
On 6/26/14, 10:14 AM, Akhil Das wrote:
> Can you paste your spark-env.sh file?
>
> Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn <squinn@gatech.edu
> <ma...@gatech.edu>> wrote:
>
> Both /etc/hosts have each other's IP addresses in them. Telneting
> from machine2 to machine1 on port 5060 works just fine.
>
> Here's the output of lsof:
>
> user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
> java 23985 user 30u IPv6 11092354 0t0 TCP machine1:sip
> (LISTEN)
> java 23985 user 40u IPv6 11099560 0t0 TCP
> machine1:sip->machine1:48315 (ESTABLISHED)
> java 23985 user 52u IPv6 11100405 0t0 TCP
> machine1:sip->machine2:54476 (ESTABLISHED)
> java 24157 user 40u IPv6 11092413 0t0 TCP
> machine1:48315->machine1:sip (ESTABLISHED)
>
> Ubuntu seems to recognize 5060 as the standard port for "sip";
> it's not actually running anything there besides Spark, it just
> does a s/5060/sip/g.
>
> Is there something to the fact that every time I comment out
> SPARK_LOCAL_IP in spark-env, it crashes immediately upon
> spark-submit due to the "address already being in use"? Or am I
> barking up the wrong tree on that one?
>
> Thanks again for all your help; I hope we can knock this one out.
>
> Shannon
>
>
> On 6/26/14, 9:13 AM, Akhil Das wrote:
>> Do you have <ip> machine1 in your workers /etc/hosts
>> also? If so try telneting from your machine2 to machine1 on port
>> 5060. Also make sure nothing else is running on port 5060 other
>> than Spark (*/lsof -i:5060/*)
>>
>> Thanks
>> Best Regards
>>
>>
>> On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <squinn@gatech.edu
>> <ma...@gatech.edu>> wrote:
>>
>> Still running into the same problem. /etc/hosts on the master
>> says
>>
>> 127.0.0.1 localhost
>> <ip> machine1
>>
>> <ip> is the same address set in spark-env.sh for
>> SPARK_MASTER_IP. Any other ideas?
>>
>>
>> On 6/26/14, 3:11 AM, Akhil Das wrote:
>>> Hi Shannon,
>>>
>>> It should be a configuration issue, check in your /etc/hosts
>>> and make sure localhost is not associated with the
>>> SPARK_MASTER_IP you provided.
>>>
>>> Thanks
>>> Best Regards
>>>
>>>
>>> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn
>>> <squinn@gatech.edu <ma...@gatech.edu>> wrote:
>>>
>>> Hi all,
>>>
>>> I have a 2-machine Spark network I've set up: a master
>>> and worker on machine1, and worker on machine2. When I
>>> run 'sbin/start-all.sh', everything starts up as it
>>> should. I see both workers listed on the UI page. The
>>> logs of both workers indicate successful registration
>>> with the Spark master.
>>>
>>> The problems begin when I attempt to submit a job: I get
>>> an "address already in use" exception that crashes the
>>> program. It says "Failed to bind to " and lists the
>>> exact port and address of the master.
>>>
>>> At this point, the only items I have set in my
>>> spark-env.sh are SPARK_MASTER_IP and SPARK_MASTER_PORT
>>> (non-standard, set to 5060).
>>>
>>> The next step I took, then, was to explicitly set
>>> SPARK_LOCAL_IP on the master to 127.0.0.1. This allows
>>> the master to successfully send out the jobs; however,
>>> it ends up canceling the stage after running this
>>> command several times:
>>>
>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor
>>> added: app-20140625210032-0000/8 on
>>> worker-20140625205623-machine2-53597 (machine2:53597)
>>> with 8 cores
>>> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend:
>>> Granted executor ID app-20140625210032-0000/8 on
>>> hostPort machine2:53597 with 8 cores, 8.0 GB RAM
>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor
>>> updated: app-20140625210032-0000/8 is now RUNNING
>>> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor
>>> updated: app-20140625210032-0000/8 is now FAILED
>>> (Command exited with code 1)
>>>
>>> The "/8" started at "/1", eventually becomes "/9", and
>>> then "/10", at which point the program crashes. The
>>> worker on machine2 shows similar messages in its logs.
>>> Here are the last bunch:
>>>
>>> 14/06/25 21:00:31 INFO Worker: Executor
>>> app-20140625210032-0000/9 finished with state FAILED
>>> message Command exited with code 1 exitStatus 1
>>> 14/06/25 21:00:31 INFO Worker: Asked to launch executor
>>> app-20140625210032-0000/10 for app_name
>>> Spark assembly has been built with Hive, including
>>> Datanucleus jars on classpath
>>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command:
>>> "java" "-cp"
>>> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>>> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>>> "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
>>> "10" "machine2" "8"
>>> "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>>> "app-20140625210032-0000"
>>> 14/06/25 21:00:33 INFO Worker: Executor
>>> app-20140625210032-0000/10 finished with state FAILED
>>> message Command exited with code 1 exitStatus 1
>>>
>>> I highlighted the part that seemed strange to me; that's
>>> the master port number (I set it to 5060), and yet it's
>>> referencing localhost? Is this the reason why machine2
>>> apparently can't seem to give a confirmation to the
>>> master once the job is submitted? (The logs from the
>>> worker on the master node indicate that it's running
>>> just fine)
>>>
>>> I appreciate any assistance you can offer!
>>>
>>> Regards,
>>> Shannon Quinn
>>>
>>>
>>
>>
>
>
Re: Spark standalone network configuration problems
Posted by Shannon Quinn <sq...@gatech.edu>.
export SPARK_MASTER_IP="192.168.1.101"
export SPARK_MASTER_PORT="5060"
export SPARK_LOCAL_IP="127.0.0.1"
That's it. If I comment out the SPARK_LOCAL_IP or set it to be the same
as SPARK_MASTER_IP, that's when it throws the "address already in use"
error. If I leave it as the localhost IP, that's when I get the
communication errors with machine2 that ultimately lead to the job failure.
Thanks!
Shannon
On 6/26/14, 10:14 AM, Akhil Das wrote:
> Can you paste your spark-env.sh file?
>
> Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn <squinn@gatech.edu
> <ma...@gatech.edu>> wrote:
>
> Both /etc/hosts have each other's IP addresses in them. Telneting
> from machine2 to machine1 on port 5060 works just fine.
>
> Here's the output of lsof:
>
> user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
> java 23985 user 30u IPv6 11092354 0t0 TCP machine1:sip
> (LISTEN)
> java 23985 user 40u IPv6 11099560 0t0 TCP
> machine1:sip->machine1:48315 (ESTABLISHED)
> java 23985 user 52u IPv6 11100405 0t0 TCP
> machine1:sip->machine2:54476 (ESTABLISHED)
> java 24157 user 40u IPv6 11092413 0t0 TCP
> machine1:48315->machine1:sip (ESTABLISHED)
>
> Ubuntu seems to recognize 5060 as the standard port for "sip";
> it's not actually running anything there besides Spark, it just
> does a s/5060/sip/g.
>
> Is there something to the fact that every time I comment out
> SPARK_LOCAL_IP in spark-env, it crashes immediately upon
> spark-submit due to the "address already being in use"? Or am I
> barking up the wrong tree on that one?
>
> Thanks again for all your help; I hope we can knock this one out.
>
> Shannon
>
>
> On 6/26/14, 9:13 AM, Akhil Das wrote:
>> Do you have <ip> machine1 in your workers /etc/hosts
>> also? If so try telneting from your machine2 to machine1 on port
>> 5060. Also make sure nothing else is running on port 5060 other
>> than Spark (*/lsof -i:5060/*)
>>
>> Thanks
>> Best Regards
>>
>>
>> On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <squinn@gatech.edu
>> <ma...@gatech.edu>> wrote:
>>
>> Still running into the same problem. /etc/hosts on the master
>> says
>>
>> 127.0.0.1 localhost
>> <ip> machine1
>>
>> <ip> is the same address set in spark-env.sh for
>> SPARK_MASTER_IP. Any other ideas?
>>
>>
>> On 6/26/14, 3:11 AM, Akhil Das wrote:
>>> Hi Shannon,
>>>
>>> It should be a configuration issue, check in your /etc/hosts
>>> and make sure localhost is not associated with the
>>> SPARK_MASTER_IP you provided.
>>>
>>> Thanks
>>> Best Regards
>>>
>>>
>>> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn
>>> <squinn@gatech.edu <ma...@gatech.edu>> wrote:
>>>
>>> Hi all,
>>>
>>> I have a 2-machine Spark network I've set up: a master
>>> and worker on machine1, and worker on machine2. When I
>>> run 'sbin/start-all.sh', everything starts up as it
>>> should. I see both workers listed on the UI page. The
>>> logs of both workers indicate successful registration
>>> with the Spark master.
>>>
>>> The problems begin when I attempt to submit a job: I get
>>> an "address already in use" exception that crashes the
>>> program. It says "Failed to bind to " and lists the
>>> exact port and address of the master.
>>>
>>> At this point, the only items I have set in my
>>> spark-env.sh are SPARK_MASTER_IP and SPARK_MASTER_PORT
>>> (non-standard, set to 5060).
>>>
>>> The next step I took, then, was to explicitly set
>>> SPARK_LOCAL_IP on the master to 127.0.0.1. This allows
>>> the master to successfully send out the jobs; however,
>>> it ends up canceling the stage after running this
>>> command several times:
>>>
>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor
>>> added: app-20140625210032-0000/8 on
>>> worker-20140625205623-machine2-53597 (machine2:53597)
>>> with 8 cores
>>> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend:
>>> Granted executor ID app-20140625210032-0000/8 on
>>> hostPort machine2:53597 with 8 cores, 8.0 GB RAM
>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor
>>> updated: app-20140625210032-0000/8 is now RUNNING
>>> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor
>>> updated: app-20140625210032-0000/8 is now FAILED
>>> (Command exited with code 1)
>>>
>>> The "/8" started at "/1", eventually becomes "/9", and
>>> then "/10", at which point the program crashes. The
>>> worker on machine2 shows similar messages in its logs.
>>> Here are the last bunch:
>>>
>>> 14/06/25 21:00:31 INFO Worker: Executor
>>> app-20140625210032-0000/9 finished with state FAILED
>>> message Command exited with code 1 exitStatus 1
>>> 14/06/25 21:00:31 INFO Worker: Asked to launch executor
>>> app-20140625210032-0000/10 for app_name
>>> Spark assembly has been built with Hive, including
>>> Datanucleus jars on classpath
>>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command:
>>> "java" "-cp"
>>> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>>> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>>> "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
>>> "10" "machine2" "8"
>>> "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>>> "app-20140625210032-0000"
>>> 14/06/25 21:00:33 INFO Worker: Executor
>>> app-20140625210032-0000/10 finished with state FAILED
>>> message Command exited with code 1 exitStatus 1
>>>
>>> I highlighted the part that seemed strange to me; that's
>>> the master port number (I set it to 5060), and yet it's
>>> referencing localhost? Is this the reason why machine2
>>> apparently can't seem to give a confirmation to the
>>> master once the job is submitted? (The logs from the
>>> worker on the master node indicate that it's running
>>> just fine)
>>>
>>> I appreciate any assistance you can offer!
>>>
>>> Regards,
>>> Shannon Quinn
>>>
>>>
>>
>>
>
>
Re: Spark standalone network configuration problems
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Can you paste your spark-env.sh file?
Thanks
Best Regards
On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn <sq...@gatech.edu> wrote:
> Both /etc/hosts have each other's IP addresses in them. Telneting from
> machine2 to machine1 on port 5060 works just fine.
>
> Here's the output of lsof:
>
> user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
> java 23985 user 30u IPv6 11092354 0t0 TCP machine1:sip (LISTEN)
> java 23985 user 40u IPv6 11099560 0t0 TCP
> machine1:sip->machine1:48315 (ESTABLISHED)
> java 23985 user 52u IPv6 11100405 0t0 TCP
> machine1:sip->machine2:54476 (ESTABLISHED)
> java 24157 user 40u IPv6 11092413 0t0 TCP
> machine1:48315->machine1:sip (ESTABLISHED)
>
> Ubuntu seems to recognize 5060 as the standard port for "sip"; it's not
> actually running anything there besides Spark, it just does a s/5060/sip/g.
>
> Is there something to the fact that every time I comment out
> SPARK_LOCAL_IP in spark-env, it crashes immediately upon spark-submit due
> to the "address already being in use"? Or am I barking up the wrong tree on
> that one?
>
> Thanks again for all your help; I hope we can knock this one out.
>
> Shannon
>
>
> On 6/26/14, 9:13 AM, Akhil Das wrote:
>
> Do you have <ip> machine1 in your workers /etc/hosts also? If
> so try telneting from your machine2 to machine1 on port 5060. Also make
> sure nothing else is running on port 5060 other than Spark (*lsof -i:5060*
> )
>
> Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>
>> Still running into the same problem. /etc/hosts on the master says
>>
>> 127.0.0.1 localhost
>> <ip> machine1
>>
>> <ip> is the same address set in spark-env.sh for SPARK_MASTER_IP. Any
>> other ideas?
>>
>>
>> On 6/26/14, 3:11 AM, Akhil Das wrote:
>>
>> Hi Shannon,
>>
>> It should be a configuration issue, check in your /etc/hosts and make
>> sure localhost is not associated with the SPARK_MASTER_IP you provided.
>>
>> Thanks
>> Best Regards
>>
>>
>> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <sq...@gatech.edu> wrote:
>>
>>> Hi all,
>>>
>>> I have a 2-machine Spark network I've set up: a master and worker on
>>> machine1, and worker on machine2. When I run 'sbin/start-all.sh',
>>> everything starts up as it should. I see both workers listed on the UI
>>> page. The logs of both workers indicate successful registration with the
>>> Spark master.
>>>
>>> The problems begin when I attempt to submit a job: I get an "address
>>> already in use" exception that crashes the program. It says "Failed to bind
>>> to " and lists the exact port and address of the master.
>>>
>>> At this point, the only items I have set in my spark-env.sh are
>>> SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
>>>
>>> The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the
>>> master to 127.0.0.1. This allows the master to successfully send out the
>>> jobs; however, it ends up canceling the stage after running this command
>>> several times:
>>>
>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added:
>>> app-20140625210032-0000/8 on worker-20140625205623-machine2-53597
>>> (machine2:53597) with 8 cores
>>> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID
>>> app-20140625210032-0000/8 on hostPort machine2:53597 with 8 cores, 8.0 GB
>>> RAM
>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated:
>>> app-20140625210032-0000/8 is now RUNNING
>>> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated:
>>> app-20140625210032-0000/8 is now FAILED (Command exited with code 1)
>>>
>>> The "/8" started at "/1", eventually becomes "/9", and then "/10", at
>>> which point the program crashes. The worker on machine2 shows similar
>>> messages in its logs. Here are the last bunch:
>>>
>>> 14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9
>>> finished with state FAILED message Command exited with code 1 exitStatus 1
>>> 14/06/25 21:00:31 INFO Worker: Asked to launch executor
>>> app-20140625210032-0000/10 for app_name
>>> Spark assembly has been built with Hive, including Datanucleus jars on
>>> classpath
>>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java" "-cp"
>>> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>>> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "
>>> *akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*" "10"
>>> "machine2" "8" "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>>> "app-20140625210032-0000"
>>> 14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10
>>> finished with state FAILED message Command exited with code 1 exitStatus 1
>>>
>>> I highlighted the part that seemed strange to me; that's the master port
>>> number (I set it to 5060), and yet it's referencing localhost? Is this the
>>> reason why machine2 apparently can't seem to give a confirmation to the
>>> master once the job is submitted? (The logs from the worker on the master
>>> node indicate that it's running just fine)
>>>
>>> I appreciate any assistance you can offer!
>>>
>>> Regards,
>>> Shannon Quinn
>>>
>>>
>>
>>
>
>
Re: Spark standalone network configuration problems
Posted by Shannon Quinn <sq...@gatech.edu>.
Both /etc/hosts have each other's IP addresses in them. Telneting from
machine2 to machine1 on port 5060 works just fine.
Here's the output of lsof:
user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 23985 user 30u IPv6 11092354 0t0 TCP machine1:sip (LISTEN)
java 23985 user 40u IPv6 11099560 0t0 TCP
machine1:sip->machine1:48315 (ESTABLISHED)
java 23985 user 52u IPv6 11100405 0t0 TCP
machine1:sip->machine2:54476 (ESTABLISHED)
java 24157 user 40u IPv6 11092413 0t0 TCP
machine1:48315->machine1:sip (ESTABLISHED)
Ubuntu seems to recognize 5060 as the standard port for "sip"; it's not
actually running anything there besides Spark, it just does a s/5060/sip/g.
Is there something to the fact that every time I comment out
SPARK_LOCAL_IP in spark-env, it crashes immediately upon spark-submit
due to the "address already being in use"? Or am I barking up the wrong
tree on that one?
Thanks again for all your help; I hope we can knock this one out.
Shannon
On 6/26/14, 9:13 AM, Akhil Das wrote:
> Do you have <ip> machine1 in your workers /etc/hosts also? If
> so try telneting from your machine2 to machine1 on port 5060. Also
> make sure nothing else is running on port 5060 other than Spark
> (*/lsof -i:5060/*)
>
> Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <squinn@gatech.edu
> <ma...@gatech.edu>> wrote:
>
> Still running into the same problem. /etc/hosts on the master says
>
> 127.0.0.1 localhost
> <ip> machine1
>
> <ip> is the same address set in spark-env.sh for SPARK_MASTER_IP.
> Any other ideas?
>
>
> On 6/26/14, 3:11 AM, Akhil Das wrote:
>> Hi Shannon,
>>
>> It should be a configuration issue, check in your /etc/hosts and
>> make sure localhost is not associated with the SPARK_MASTER_IP
>> you provided.
>>
>> Thanks
>> Best Regards
>>
>>
>> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <squinn@gatech.edu
>> <ma...@gatech.edu>> wrote:
>>
>> Hi all,
>>
>> I have a 2-machine Spark network I've set up: a master and
>> worker on machine1, and worker on machine2. When I run
>> 'sbin/start-all.sh', everything starts up as it should. I see
>> both workers listed on the UI page. The logs of both workers
>> indicate successful registration with the Spark master.
>>
>> The problems begin when I attempt to submit a job: I get an
>> "address already in use" exception that crashes the program.
>> It says "Failed to bind to " and lists the exact port and
>> address of the master.
>>
>> At this point, the only items I have set in my spark-env.sh
>> are SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set
>> to 5060).
>>
>> The next step I took, then, was to explicitly set
>> SPARK_LOCAL_IP on the master to 127.0.0.1. This allows the
>> master to successfully send out the jobs; however, it ends up
>> canceling the stage after running this command several times:
>>
>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added:
>> app-20140625210032-0000/8 on
>> worker-20140625205623-machine2-53597 (machine2:53597) with 8
>> cores
>> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted
>> executor ID app-20140625210032-0000/8 on hostPort
>> machine2:53597 with 8 cores, 8.0 GB RAM
>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor
>> updated: app-20140625210032-0000/8 is now RUNNING
>> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor
>> updated: app-20140625210032-0000/8 is now FAILED (Command
>> exited with code 1)
>>
>> The "/8" started at "/1", eventually becomes "/9", and then
>> "/10", at which point the program crashes. The worker on
>> machine2 shows similar messages in its logs. Here are the
>> last bunch:
>>
>> 14/06/25 21:00:31 INFO Worker: Executor
>> app-20140625210032-0000/9 finished with state FAILED message
>> Command exited with code 1 exitStatus 1
>> 14/06/25 21:00:31 INFO Worker: Asked to launch executor
>> app-20140625210032-0000/10 for app_name
>> Spark assembly has been built with Hive, including
>> Datanucleus jars on classpath
>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java"
>> "-cp"
>> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>> "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
>> "10" "machine2" "8"
>> "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>> "app-20140625210032-0000"
>> 14/06/25 21:00:33 INFO Worker: Executor
>> app-20140625210032-0000/10 finished with state FAILED message
>> Command exited with code 1 exitStatus 1
>>
>> I highlighted the part that seemed strange to me; that's the
>> master port number (I set it to 5060), and yet it's
>> referencing localhost? Is this the reason why machine2
>> apparently can't seem to give a confirmation to the master
>> once the job is submitted? (The logs from the worker on the
>> master node indicate that it's running just fine)
>>
>> I appreciate any assistance you can offer!
>>
>> Regards,
>> Shannon Quinn
>>
>>
>
>
Re: Spark standalone network configuration problems
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Do you have <ip> machine1 in your workers /etc/hosts also? If so
try telneting from your machine2 to machine1 on port 5060. Also make sure
nothing else is running on port 5060 other than Spark (*lsof -i:5060*)
Thanks
Best Regards
On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <sq...@gatech.edu> wrote:
> Still running into the same problem. /etc/hosts on the master says
>
> 127.0.0.1 localhost
> <ip> machine1
>
> <ip> is the same address set in spark-env.sh for SPARK_MASTER_IP. Any
> other ideas?
>
>
> On 6/26/14, 3:11 AM, Akhil Das wrote:
>
> Hi Shannon,
>
> It should be a configuration issue, check in your /etc/hosts and make
> sure localhost is not associated with the SPARK_MASTER_IP you provided.
>
> Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <sq...@gatech.edu> wrote:
>
>> Hi all,
>>
>> I have a 2-machine Spark network I've set up: a master and worker on
>> machine1, and worker on machine2. When I run 'sbin/start-all.sh',
>> everything starts up as it should. I see both workers listed on the UI
>> page. The logs of both workers indicate successful registration with the
>> Spark master.
>>
>> The problems begin when I attempt to submit a job: I get an "address
>> already in use" exception that crashes the program. It says "Failed to bind
>> to " and lists the exact port and address of the master.
>>
>> At this point, the only items I have set in my spark-env.sh are
>> SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
>>
>> The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the
>> master to 127.0.0.1. This allows the master to successfully send out the
>> jobs; however, it ends up canceling the stage after running this command
>> several times:
>>
>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added:
>> app-20140625210032-0000/8 on worker-20140625205623-machine2-53597
>> (machine2:53597) with 8 cores
>> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID
>> app-20140625210032-0000/8 on hostPort machine2:53597 with 8 cores, 8.0 GB
>> RAM
>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated:
>> app-20140625210032-0000/8 is now RUNNING
>> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated:
>> app-20140625210032-0000/8 is now FAILED (Command exited with code 1)
>>
>> The "/8" started at "/1", eventually becomes "/9", and then "/10", at
>> which point the program crashes. The worker on machine2 shows similar
>> messages in its logs. Here are the last bunch:
>>
>> 14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9
>> finished with state FAILED message Command exited with code 1 exitStatus 1
>> 14/06/25 21:00:31 INFO Worker: Asked to launch executor
>> app-20140625210032-0000/10 for app_name
>> Spark assembly has been built with Hive, including Datanucleus jars on
>> classpath
>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java" "-cp"
>> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
>> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "
>> *akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*" "10"
>> "machine2" "8" "akka.tcp://sparkWorker@machine2:53597/user/Worker"
>> "app-20140625210032-0000"
>> 14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10
>> finished with state FAILED message Command exited with code 1 exitStatus 1
>>
>> I highlighted the part that seemed strange to me; that's the master port
>> number (I set it to 5060), and yet it's referencing localhost? Is this the
>> reason why machine2 apparently can't seem to give a confirmation to the
>> master once the job is submitted? (The logs from the worker on the master
>> node indicate that it's running just fine)
>>
>> I appreciate any assistance you can offer!
>>
>> Regards,
>> Shannon Quinn
>>
>>
>
>
Re: Spark standalone network configuration problems
Posted by Shannon Quinn <sq...@gatech.edu>.
Still running into the same problem. /etc/hosts on the master says
127.0.0.1 localhost
<ip> machine1
<ip> is the same address set in spark-env.sh for SPARK_MASTER_IP. Any
other ideas?
On 6/26/14, 3:11 AM, Akhil Das wrote:
> Hi Shannon,
>
> It should be a configuration issue, check in your /etc/hosts and make
> sure localhost is not associated with the SPARK_MASTER_IP you provided.
>
> Thanks
> Best Regards
>
>
> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <squinn@gatech.edu
> <ma...@gatech.edu>> wrote:
>
> Hi all,
>
> I have a 2-machine Spark network I've set up: a master and worker
> on machine1, and worker on machine2. When I run
> 'sbin/start-all.sh', everything starts up as it should. I see both
> workers listed on the UI page. The logs of both workers indicate
> successful registration with the Spark master.
>
> The problems begin when I attempt to submit a job: I get an
> "address already in use" exception that crashes the program. It
> says "Failed to bind to " and lists the exact port and address of
> the master.
>
> At this point, the only items I have set in my spark-env.sh are
> SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
>
> The next step I took, then, was to explicitly set SPARK_LOCAL_IP
> on the master to 127.0.0.1. This allows the master to successfully
> send out the jobs; however, it ends up canceling the stage after
> running this command several times:
>
> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added:
> app-20140625210032-0000/8 on worker-20140625205623-machine2-53597
> (machine2:53597) with 8 cores
> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted
> executor ID app-20140625210032-0000/8 on hostPort machine2:53597
> with 8 cores, 8.0 GB RAM
> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated:
> app-20140625210032-0000/8 is now RUNNING
> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated:
> app-20140625210032-0000/8 is now FAILED (Command exited with code 1)
>
> The "/8" started at "/1", eventually becomes "/9", and then "/10",
> at which point the program crashes. The worker on machine2 shows
> similar messages in its logs. Here are the last bunch:
>
> 14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9
> finished with state FAILED message Command exited with code 1
> exitStatus 1
> 14/06/25 21:00:31 INFO Worker: Asked to launch executor
> app-20140625210032-0000/10 for app_name
> Spark assembly has been built with Hive, including Datanucleus
> jars on classpath
> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java"
> "-cp"
> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
> "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
> "10" "machine2" "8"
> "akka.tcp://sparkWorker@machine2:53597/user/Worker"
> "app-20140625210032-0000"
> 14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10
> finished with state FAILED message Command exited with code 1
> exitStatus 1
>
> I highlighted the part that seemed strange to me; that's the
> master port number (I set it to 5060), and yet it's referencing
> localhost? Is this the reason why machine2 apparently can't seem
> to give a confirmation to the master once the job is submitted?
> (The logs from the worker on the master node indicate that it's
> running just fine)
>
> I appreciate any assistance you can offer!
>
> Regards,
> Shannon Quinn
>
>
Re: Spark standalone network configuration problems
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Hi Shannon,
It should be a configuration issue, check in your /etc/hosts and make sure
localhost is not associated with the SPARK_MASTER_IP you provided.
Thanks
Best Regards
On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <sq...@gatech.edu> wrote:
> Hi all,
>
> I have a 2-machine Spark network I've set up: a master and worker on
> machine1, and worker on machine2. When I run 'sbin/start-all.sh',
> everything starts up as it should. I see both workers listed on the UI
> page. The logs of both workers indicate successful registration with the
> Spark master.
>
> The problems begin when I attempt to submit a job: I get an "address
> already in use" exception that crashes the program. It says "Failed to bind
> to " and lists the exact port and address of the master.
>
> At this point, the only items I have set in my spark-env.sh are
> SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
>
> The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the
> master to 127.0.0.1. This allows the master to successfully send out the
> jobs; however, it ends up canceling the stage after running this command
> several times:
>
> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added:
> app-20140625210032-0000/8 on worker-20140625205623-machine2-53597
> (machine2:53597) with 8 cores
> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140625210032-0000/8 on hostPort machine2:53597 with 8 cores, 8.0 GB
> RAM
> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated:
> app-20140625210032-0000/8 is now RUNNING
> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated:
> app-20140625210032-0000/8 is now FAILED (Command exited with code 1)
>
> The "/8" started at "/1", eventually becomes "/9", and then "/10", at
> which point the program crashes. The worker on machine2 shows similar
> messages in its logs. Here are the last bunch:
>
> 14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9 finished
> with state FAILED message Command exited with code 1 exitStatus 1
> 14/06/25 21:00:31 INFO Worker: Asked to launch executor
> app-20140625210032-0000/10 for app_name
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java" "-cp"
> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "
> *akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*" "10"
> "machine2" "8" "akka.tcp://sparkWorker@machine2:53597/user/Worker"
> "app-20140625210032-0000"
> 14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10
> finished with state FAILED message Command exited with code 1 exitStatus 1
>
> I highlighted the part that seemed strange to me; that's the master port
> number (I set it to 5060), and yet it's referencing localhost? Is this the
> reason why machine2 apparently can't seem to give a confirmation to the
> master once the job is submitted? (The logs from the worker on the master
> node indicate that it's running just fine)
>
> I appreciate any assistance you can offer!
>
> Regards,
> Shannon Quinn
>
>