You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Andrew Jones <an...@andrew-jones.com> on 2014/07/03 16:00:46 UTC

Hadoop on Mesos instantly terminates after registering

Hi,

I'm having a bit of trouble running Hadoop on Mesos.

I've followed the instructions on https://github.com/mesos/hadoop, using
the same CDH5 package and Mesos 0.19. The only difference I can think of
is I am using an existing HDFS cluster, not a local one.

When starting the job tracker, it all looks fine:

14/07/03 13:51:57 INFO mapred.MesosScheduler: Starting MesosScheduler
Warning: MESOS_NATIVE_LIBRARY is deprecated, use
MESOS_NATIVE_JAVA_LIBRARY instead. Future releases will not support JNI
bindings via MESOS_NATIVE_LIBRARY.
I0703 13:51:57.096916 62004 sched.cpp:126] Version: 0.19.0
I0703 13:51:57.100489 62055 sched.cpp:222] New master detected at
master@10.6.27.72:5050
I0703 13:51:57.100621 62055 sched.cpp:230] No credentials provided.
Attempting to register without authentication
14/07/03 13:51:57 INFO mapred.JobTracker: Starting the recovery process
for 0 jobs ...
14/07/03 13:51:57 INFO mapred.JobTracker: Recovery done! Recoverd 0 of 0
jobs.
14/07/03 13:51:57 INFO mapred.JobTracker: Recovery Duration (ms):0
14/07/03 13:51:57 INFO mapred.JobTracker: Refreshing hosts information
14/07/03 13:51:57 INFO util.HostsFileReader: Setting the includes file
to 
14/07/03 13:51:57 INFO util.HostsFileReader: Setting the excludes file
to 
14/07/03 13:51:57 INFO util.HostsFileReader: Refreshing hosts
(include/exclude) list
14/07/03 13:51:57 INFO mapred.JobTracker: Decommissioning 0 nodes
14/07/03 13:51:57 INFO ipc.Server: IPC Server listener on 9001: starting
14/07/03 13:51:57 INFO ipc.Server: IPC Server Responder: starting
14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 0 on 9001:
starting
14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 1 on 9001:
starting
14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 2 on 9001:
starting
14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 3 on 9001:
starting
14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 4 on 9001:
starting
14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 5 on 9001:
starting
14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 6 on 9001:
starting
14/07/03 13:51:57 INFO mapred.JobTracker: Starting RUNNING
14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 7 on 9001:
starting
14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 8 on 9001:
starting
14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 9 on 9001:
starting

But after connecting to Mesos and registering as a framework, it seems
to instantly terminate, then try again. See screenshot from the
terminated frameworks table at https://db.tt/COLpSUIQ.

Nothing more is posted to the job tracker log. I can't see much in the
mesos log on the master. Here are the messages for one framework:

I0703 13:57:26.040679 51675 master.cpp:1059] Registering framework
20140620-174222-1209730570-5050-51658-0666 at
scheduler(1)@127.0.1.1:53662
I0703 13:57:26.040802 51675 hierarchical_allocator_process.hpp:331]
Added framework 20140620-174222-1209730570-5050-51658-0666
I0703 13:57:26.040817 51663 master.cpp:662] Framework
20140620-174222-1209730570-5050-51658-0666 disconnected
I0703 13:57:26.041102 51663 master.cpp:1319] Deactivating framework
20140620-174222-1209730570-5050-51658-0666
I0703 13:57:26.041127 51663 master.cpp:684] Giving framework
20140620-174222-1209730570-5050-51658-0666 0ns to failover
W0703 13:57:26.041158 51663 master.cpp:2862] Master returning resources
offered to framework 20140620-174222-1209730570-5050-51658-0666 because
the framework has terminated or is inactive
I0703 13:57:26.041177 51664 hierarchical_allocator_process.hpp:407]
Deactivated framework 20140620-174222-1209730570-5050-51658-0666
I0703 13:57:26.041251 51666 master.cpp:2849] Framework failover timeout,
removing framework 20140620-174222-1209730570-5050-51658-0666
I0703 13:57:26.041373 51666 master.cpp:3344] Removing framework
20140620-174222-1209730570-5050-51658-0666
I0703 13:57:26.041261 51664 hierarchical_allocator_process.hpp:636]
Recovered cpus(*):16; mem(*):192391; disk(*):1.51388e+06;
ports(*):[31000-32000] (total allocatable: cpus(*):16; mem(*):192391;
disk(*):1.51388e+06; ports(*):[31000-32000]) on slave
20140618-174325-1209730570-5050-4637-1 from framework
20140620-174222-1209730570-5050-51658-0666
I0703 13:57:26.041502 51664 hierarchical_allocator_process.hpp:636]
Recovered cpus(*):16; mem(*):192391; disk(*):1.51388e+06;
ports(*):[31000-32000] (total allocatable: cpus(*):16; mem(*):192391;
disk(*):1.51388e+06; ports(*):[31000-32000]) on slave
20140618-174325-1209730570-5050-4637-0 from framework
20140620-174222-1209730570-5050-51658-0666
I0703 13:57:26.041566 51664 hierarchical_allocator_process.hpp:636]
Recovered cpus(*):15; mem(*):191879; disk(*):1.51388e+06;
ports(*):[31000-31643, 31645-32000] (total allocatable: cpus(*):15;
mem(*):191879; disk(*):1.51388e+06; ports(*):[31000-31643, 31645-32000])
on slave 20140618-172514-1209730570-5050-1282-0 from framework
20140620-174222-1209730570-5050-51658-0666
I0703 13:57:26.041592 51664 hierarchical_allocator_process.hpp:362]
Removed framework 20140620-174222-1209730570-5050-51658-0666

Setting HADOOP_ROOT_LOGGER=DEBUG,console before running the job tracker
didn't seem to give me any more Mesos related messages.

Has anyone else came across this problem? Any ideas what might be
causing this?

Is there any way I can increase the logging for Mesos or the Mesos
Hadoop library?

This is for Mesos 0.19 on Ubuntu 14.04. Mesos cluster is running other
frameworks without problems (Marathon, Chronos). I'm new to both Mesos
and Hadoop.

Thanks,
Andrew

Re: Hadoop on Mesos instantly terminates after registering

Posted by Vinod Kone <vi...@gmail.com>.

On Thu, Jul 3, 2014 at 7:00 AM, Andrew Jones <an...@andrew-jones.com>
wrote:

> I0703 13:57:26.040679 51675 master.cpp:1059] Registering framework
> 20140620-174222-1209730570-5050-51658-0666 at
> scheduler(1)@127.0.1.1:53662
>

The hadoop scheduler is registering with master but using a local ip
address (127.0.0.1). Setting LIBPROCESS_IP in the environment of hadoop
scheduler to a publicly accessible IP should fix this issue.