You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Arshak Navruzyan <ar...@gmail.com> on 2014/01/01 00:02:16 UTC

slave tserver not responding

I configured a new instance with a master and a slave tserver.  When I do
start-all on the master, the slave doesn't come up.  I am wondering if it's
because I left the instance secret as the default. (I get an exception when
I try to change that).

This is what I see in the master's monitor regarding the slave

Non-Functioning Tablet Servers
The following tablet servers reported a status other than Online

10.240.203.36:9997UNRESPONSIVE
In the master log I see the following

2013-12-31 22:56:13,665 [master.Master] ERROR: unable to get tablet server
status 10.240.203.36:9997[1434a79d34404a2]
org.apache.thrift.transport.TTransportException:
java.net.NoRouteToHostException: No route to host
2013-12-31 22:56:13,712 [master.Master] ERROR: unable to get tablet server
status 10.240.203.36:9997[1434a79d34404a2]
org.apache.thrift.transport.TTransportException:
java.net.NoRouteToHostException: No route to host
2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO : Loaded class
org.apache.accumulo.server.master.balancer.DefaultLoadBalancer for table !0
2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1 tablets
2013-12-31 22:56:13,812 [master.Master] ERROR: Error processing table state
for store Root Tablet
org.apache.thrift.transport.TTransportException:
java.net.NoRouteToHostException: No route to host
        at
org.apache.accumulo.core.client.impl.ThriftTransportPool.createNewTransport(ThriftTransportPool.java:475)
        at
org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:464)
        at
org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:441)
        at
org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransportWithDefaultTimeout(ThriftTransportPool.java:366)



In the slave's tserver.log all I see is

2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL: Lost tablet
server lock (reason = LOCK_DELETED), exiting.

Re: slave tserver not responding

Posted by Arshak Navruzyan <ar...@gmail.com>.
Here is the output of bin/start-here.sh on the slave node

bin/start-here.sh
Starting tablet server on 10.240.203.36
WARN : Max files open on 10.240.203.36 is 1024, recommend 65536
Starting garbage collector on localhost
WARN : Max files open on localhost is 1024, recommend 65536
Starting tracer on localhost
WARN : Max files open on localhost is 1024, recommend 65536


On Tue, Dec 31, 2013 at 7:19 PM, Sean Busbey <bu...@clouderagovt.com>wrote:

> Can you paste the output of running bin/start-here.sh on the worker node?
> On Dec 31, 2013 9:16 PM, "Arshak Navruzyan" <ar...@gmail.com> wrote:
>
>> This is what I see when I try to start it again when the master is
>> running.  I am sure that I don't have any other version of Accumulo running
>> anywhere or any discrepancy between the master and slave in terms of jars
>> (installed from the same jar)
>>
>> 10.240.165.43 : monitor already running (16136)
>> Starting tablet servers ..... done
>> 10.240.165.43 : tablet server already running (16359)
>> Starting tablet server on 10.240.203.36
>> 2014-01-01 03:12:26,234 [server.Accumulo] INFO : Attempting to talk to
>> zookeeper
>> 2014-01-01 03:12:26,373 [server.Accumulo] INFO : Zookeeper connected and
>> initialized, attemping to talk to HDFS
>> 2014-01-01 03:12:26,377 [server.Accumulo] INFO : Connected to HDFS
>> WARN : Max files open on 10.240.203.36 is 1024, recommend 65536
>> 10.240.165.43 : master already running (16549)
>> localhost : garbage collector already running (16634)
>> localhost : tracer already running (16728)
>>
>>
>>
>> On Tue, Dec 31, 2013 at 6:35 PM, Christopher <ct...@apache.org> wrote:
>>
>>> Have you tried running bin/start-all.sh a second time, while the
>>> master is still running? And, are you sure you don't have any other
>>> versions of Accumulo (perhaps of a different version) running?
>>>
>>> Also, check your lib directories to ensure you don't have any older
>>> versions of Accumulo jars in there.
>>>
>>> --
>>> Christopher L Tubbs II
>>> http://gravatar.com/ctubbsii
>>>
>>>
>>> On Tue, Dec 31, 2013 at 7:31 PM, Arshak Navruzyan <ar...@gmail.com>
>>> wrote:
>>> > I can connect fine from the zkCli of the slave using
>>> >
>>> > bin/zkCli.sh -server 10.240.165.43:2181
>>> >
>>> > and do
>>> >
>>> > ls /accumulo
>>> >
>>> >
>>> > On Tue, Dec 31, 2013 at 3:20 PM, Sean Busbey <bu...@clouderagovt.com>
>>> > wrote:
>>> >>
>>> >>
>>> >> On Dec 31, 2013 5:17 PM, "Arshak Navruzyan" <ar...@gmail.com>
>>> wrote:
>>> >> >
>>> >> > IPv4.  I am using ip address when I ssh.  masters and slaves files
>>> in
>>> >> > the conf directory also have ip addresses.
>>> >> >
>>> >> >
>>> >>
>>> >> How about trying to use the zookeeper-client from one of the failing
>>> >> tserver hosts?
>>> >
>>> >
>>>
>>
>>

Re: slave tserver not responding

Posted by Sean Busbey <bu...@clouderagovt.com>.
Can you paste the output of running bin/start-here.sh on the worker node?
On Dec 31, 2013 9:16 PM, "Arshak Navruzyan" <ar...@gmail.com> wrote:

> This is what I see when I try to start it again when the master is
> running.  I am sure that I don't have any other version of Accumulo running
> anywhere or any discrepancy between the master and slave in terms of jars
> (installed from the same jar)
>
> 10.240.165.43 : monitor already running (16136)
> Starting tablet servers ..... done
> 10.240.165.43 : tablet server already running (16359)
> Starting tablet server on 10.240.203.36
> 2014-01-01 03:12:26,234 [server.Accumulo] INFO : Attempting to talk to
> zookeeper
> 2014-01-01 03:12:26,373 [server.Accumulo] INFO : Zookeeper connected and
> initialized, attemping to talk to HDFS
> 2014-01-01 03:12:26,377 [server.Accumulo] INFO : Connected to HDFS
> WARN : Max files open on 10.240.203.36 is 1024, recommend 65536
> 10.240.165.43 : master already running (16549)
> localhost : garbage collector already running (16634)
> localhost : tracer already running (16728)
>
>
>
> On Tue, Dec 31, 2013 at 6:35 PM, Christopher <ct...@apache.org> wrote:
>
>> Have you tried running bin/start-all.sh a second time, while the
>> master is still running? And, are you sure you don't have any other
>> versions of Accumulo (perhaps of a different version) running?
>>
>> Also, check your lib directories to ensure you don't have any older
>> versions of Accumulo jars in there.
>>
>> --
>> Christopher L Tubbs II
>> http://gravatar.com/ctubbsii
>>
>>
>> On Tue, Dec 31, 2013 at 7:31 PM, Arshak Navruzyan <ar...@gmail.com>
>> wrote:
>> > I can connect fine from the zkCli of the slave using
>> >
>> > bin/zkCli.sh -server 10.240.165.43:2181
>> >
>> > and do
>> >
>> > ls /accumulo
>> >
>> >
>> > On Tue, Dec 31, 2013 at 3:20 PM, Sean Busbey <bu...@clouderagovt.com>
>> > wrote:
>> >>
>> >>
>> >> On Dec 31, 2013 5:17 PM, "Arshak Navruzyan" <ar...@gmail.com> wrote:
>> >> >
>> >> > IPv4.  I am using ip address when I ssh.  masters and slaves files in
>> >> > the conf directory also have ip addresses.
>> >> >
>> >> >
>> >>
>> >> How about trying to use the zookeeper-client from one of the failing
>> >> tserver hosts?
>> >
>> >
>>
>
>

Re: slave tserver not responding

Posted by Arshak Navruzyan <ar...@gmail.com>.
This is what I see when I try to start it again when the master is running.
 I am sure that I don't have any other version of Accumulo running anywhere
or any discrepancy between the master and slave in terms of jars (installed
from the same jar)

10.240.165.43 : monitor already running (16136)
Starting tablet servers ..... done
10.240.165.43 : tablet server already running (16359)
Starting tablet server on 10.240.203.36
2014-01-01 03:12:26,234 [server.Accumulo] INFO : Attempting to talk to
zookeeper
2014-01-01 03:12:26,373 [server.Accumulo] INFO : Zookeeper connected and
initialized, attemping to talk to HDFS
2014-01-01 03:12:26,377 [server.Accumulo] INFO : Connected to HDFS
WARN : Max files open on 10.240.203.36 is 1024, recommend 65536
10.240.165.43 : master already running (16549)
localhost : garbage collector already running (16634)
localhost : tracer already running (16728)



On Tue, Dec 31, 2013 at 6:35 PM, Christopher <ct...@apache.org> wrote:

> Have you tried running bin/start-all.sh a second time, while the
> master is still running? And, are you sure you don't have any other
> versions of Accumulo (perhaps of a different version) running?
>
> Also, check your lib directories to ensure you don't have any older
> versions of Accumulo jars in there.
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Tue, Dec 31, 2013 at 7:31 PM, Arshak Navruzyan <ar...@gmail.com>
> wrote:
> > I can connect fine from the zkCli of the slave using
> >
> > bin/zkCli.sh -server 10.240.165.43:2181
> >
> > and do
> >
> > ls /accumulo
> >
> >
> > On Tue, Dec 31, 2013 at 3:20 PM, Sean Busbey <bu...@clouderagovt.com>
> > wrote:
> >>
> >>
> >> On Dec 31, 2013 5:17 PM, "Arshak Navruzyan" <ar...@gmail.com> wrote:
> >> >
> >> > IPv4.  I am using ip address when I ssh.  masters and slaves files in
> >> > the conf directory also have ip addresses.
> >> >
> >> >
> >>
> >> How about trying to use the zookeeper-client from one of the failing
> >> tserver hosts?
> >
> >
>

Re: slave tserver not responding

Posted by Christopher <ct...@apache.org>.
Have you tried running bin/start-all.sh a second time, while the
master is still running? And, are you sure you don't have any other
versions of Accumulo (perhaps of a different version) running?

Also, check your lib directories to ensure you don't have any older
versions of Accumulo jars in there.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Tue, Dec 31, 2013 at 7:31 PM, Arshak Navruzyan <ar...@gmail.com> wrote:
> I can connect fine from the zkCli of the slave using
>
> bin/zkCli.sh -server 10.240.165.43:2181
>
> and do
>
> ls /accumulo
>
>
> On Tue, Dec 31, 2013 at 3:20 PM, Sean Busbey <bu...@clouderagovt.com>
> wrote:
>>
>>
>> On Dec 31, 2013 5:17 PM, "Arshak Navruzyan" <ar...@gmail.com> wrote:
>> >
>> > IPv4.  I am using ip address when I ssh.  masters and slaves files in
>> > the conf directory also have ip addresses.
>> >
>> >
>>
>> How about trying to use the zookeeper-client from one of the failing
>> tserver hosts?
>
>

Re: slave tserver not responding

Posted by Arshak Navruzyan <ar...@gmail.com>.
I can connect fine from the zkCli of the slave using

bin/zkCli.sh -server 10.240.165.43:2181

and do

ls /accumulo


On Tue, Dec 31, 2013 at 3:20 PM, Sean Busbey <bu...@clouderagovt.com>wrote:

>
> On Dec 31, 2013 5:17 PM, "Arshak Navruzyan" <ar...@gmail.com> wrote:
> >
> > IPv4.  I am using ip address when I ssh.  masters and slaves files in
> the conf directory also have ip addresses.
> >
> >
>
> How about trying to use the zookeeper-client from one of the failing
> tserver hosts?
>

Re: slave tserver not responding

Posted by Sean Busbey <bu...@clouderagovt.com>.
On Dec 31, 2013 5:17 PM, "Arshak Navruzyan" <ar...@gmail.com> wrote:
>
> IPv4.  I am using ip address when I ssh.  masters and slaves files in the
conf directory also have ip addresses.
>
>

How about trying to use the zookeeper-client from one of the failing
tserver hosts?

Re: slave tserver not responding

Posted by Arshak Navruzyan <ar...@gmail.com>.
Hadoop 1.0.4
Subversion
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r
1393290
Compiled by hortonfo on Wed Oct  3 05:17:59 UTC 2012


On Tue, Dec 31, 2013 at 3:22 PM, Sean Busbey <bu...@clouderagovt.com>wrote:

>
> Oh! What does "hadoop version" report?
> On Dec 31, 2013 5:17 PM, "Arshak Navruzyan" <ar...@gmail.com> wrote:
>
>> IPv4.  I am using ip address when I ssh.  masters and slaves files in the
>> conf directory also have ip addresses.
>>
>>
>> On Tue, Dec 31, 2013 at 3:13 PM, Sean Busbey <bu...@clouderagovt.com>wrote:
>>
>>> when you ssh, are you using hostnames or hte ip addresses?
>>>
>>> is IPv6 present?
>>>
>>>
>>> On Tue, Dec 31, 2013 at 5:11 PM, Arshak Navruzyan <ar...@gmail.com>wrote:
>>>
>>>> Accumulo 1.5.  Nothing in the *.err or *.out files on either the master
>>>> or the slave.
>>>>
>>>> Needless to say I can ssh from the master to the slave.
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> On Tue, Dec 31, 2013 at 3:05 PM, Christopher <ct...@apache.org>wrote:
>>>>
>>>>> What version?
>>>>> Also, check the contents of the *.err and *.out logs.
>>>>>
>>>>>
>>>>> --
>>>>> Christopher L Tubbs II
>>>>> http://gravatar.com/ctubbsii
>>>>>
>>>>>
>>>>> On Tue, Dec 31, 2013 at 6:02 PM, Arshak Navruzyan <ar...@gmail.com>wrote:
>>>>>
>>>>>> I configured a new instance with a master and a slave tserver.  When
>>>>>> I do start-all on the master, the slave doesn't come up.  I am wondering if
>>>>>> it's because I left the instance secret as the default. (I get an exception
>>>>>> when I try to change that).
>>>>>>
>>>>>> This is what I see in the master's monitor regarding the slave
>>>>>>
>>>>>> Non-Functioning Tablet Servers
>>>>>> The following tablet servers reported a status other than Online
>>>>>>
>>>>>>  10.240.203.36:9997 UNRESPONSIVE
>>>>>> In the master log I see the following
>>>>>>
>>>>>> 2013-12-31 22:56:13,665 [master.Master] ERROR: unable to get tablet
>>>>>> server status 10.240.203.36:9997[1434a79d34404a2]
>>>>>> org.apache.thrift.transport.TTransportException:
>>>>>> java.net.NoRouteToHostException: No route to host
>>>>>> 2013-12-31 22:56:13,712 [master.Master] ERROR: unable to get tablet
>>>>>> server status 10.240.203.36:9997[1434a79d34404a2]
>>>>>> org.apache.thrift.transport.TTransportException:
>>>>>> java.net.NoRouteToHostException: No route to host
>>>>>> 2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO : Loaded
>>>>>> class org.apache.accumulo.server.master.balancer.DefaultLoadBalancer for
>>>>>> table !0
>>>>>> 2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1 tablets
>>>>>> 2013-12-31 22:56:13,812 [master.Master] ERROR: Error processing table
>>>>>> state for store Root Tablet
>>>>>> org.apache.thrift.transport.TTransportException:
>>>>>> java.net.NoRouteToHostException: No route to host
>>>>>>         at
>>>>>> org.apache.accumulo.core.client.impl.ThriftTransportPool.createNewTransport(ThriftTransportPool.java:475)
>>>>>>         at
>>>>>> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:464)
>>>>>>         at
>>>>>> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:441)
>>>>>>         at
>>>>>> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransportWithDefaultTimeout(ThriftTransportPool.java:366)
>>>>>>
>>>>>>
>>>>>>
>>>>>> In the slave's tserver.log all I see is
>>>>>>
>>>>>> 2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL: Lost
>>>>>> tablet server lock (reason = LOCK_DELETED), exiting.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Sean
>>>
>>
>>

Re: slave tserver not responding

Posted by Sean Busbey <bu...@clouderagovt.com>.
Oh! What does "hadoop version" report?
On Dec 31, 2013 5:17 PM, "Arshak Navruzyan" <ar...@gmail.com> wrote:

> IPv4.  I am using ip address when I ssh.  masters and slaves files in the
> conf directory also have ip addresses.
>
>
> On Tue, Dec 31, 2013 at 3:13 PM, Sean Busbey <bu...@clouderagovt.com>wrote:
>
>> when you ssh, are you using hostnames or hte ip addresses?
>>
>> is IPv6 present?
>>
>>
>> On Tue, Dec 31, 2013 at 5:11 PM, Arshak Navruzyan <ar...@gmail.com>wrote:
>>
>>> Accumulo 1.5.  Nothing in the *.err or *.out files on either the master
>>> or the slave.
>>>
>>> Needless to say I can ssh from the master to the slave.
>>>
>>> Thanks!
>>>
>>>
>>> On Tue, Dec 31, 2013 at 3:05 PM, Christopher <ct...@apache.org>wrote:
>>>
>>>> What version?
>>>> Also, check the contents of the *.err and *.out logs.
>>>>
>>>>
>>>> --
>>>> Christopher L Tubbs II
>>>> http://gravatar.com/ctubbsii
>>>>
>>>>
>>>> On Tue, Dec 31, 2013 at 6:02 PM, Arshak Navruzyan <ar...@gmail.com>wrote:
>>>>
>>>>> I configured a new instance with a master and a slave tserver.  When I
>>>>> do start-all on the master, the slave doesn't come up.  I am wondering if
>>>>> it's because I left the instance secret as the default. (I get an exception
>>>>> when I try to change that).
>>>>>
>>>>> This is what I see in the master's monitor regarding the slave
>>>>>
>>>>> Non-Functioning Tablet Servers
>>>>> The following tablet servers reported a status other than Online
>>>>>
>>>>>  10.240.203.36:9997 UNRESPONSIVE
>>>>> In the master log I see the following
>>>>>
>>>>> 2013-12-31 22:56:13,665 [master.Master] ERROR: unable to get tablet
>>>>> server status 10.240.203.36:9997[1434a79d34404a2]
>>>>> org.apache.thrift.transport.TTransportException:
>>>>> java.net.NoRouteToHostException: No route to host
>>>>> 2013-12-31 22:56:13,712 [master.Master] ERROR: unable to get tablet
>>>>> server status 10.240.203.36:9997[1434a79d34404a2]
>>>>> org.apache.thrift.transport.TTransportException:
>>>>> java.net.NoRouteToHostException: No route to host
>>>>> 2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO : Loaded
>>>>> class org.apache.accumulo.server.master.balancer.DefaultLoadBalancer for
>>>>> table !0
>>>>> 2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1 tablets
>>>>> 2013-12-31 22:56:13,812 [master.Master] ERROR: Error processing table
>>>>> state for store Root Tablet
>>>>> org.apache.thrift.transport.TTransportException:
>>>>> java.net.NoRouteToHostException: No route to host
>>>>>         at
>>>>> org.apache.accumulo.core.client.impl.ThriftTransportPool.createNewTransport(ThriftTransportPool.java:475)
>>>>>         at
>>>>> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:464)
>>>>>         at
>>>>> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:441)
>>>>>         at
>>>>> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransportWithDefaultTimeout(ThriftTransportPool.java:366)
>>>>>
>>>>>
>>>>>
>>>>> In the slave's tserver.log all I see is
>>>>>
>>>>> 2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL: Lost tablet
>>>>> server lock (reason = LOCK_DELETED), exiting.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Sean
>>
>
>

Re: slave tserver not responding

Posted by Arshak Navruzyan <ar...@gmail.com>.
IPv4.  I am using ip address when I ssh.  masters and slaves files in the
conf directory also have ip addresses.


On Tue, Dec 31, 2013 at 3:13 PM, Sean Busbey <bu...@clouderagovt.com>wrote:

> when you ssh, are you using hostnames or hte ip addresses?
>
> is IPv6 present?
>
>
> On Tue, Dec 31, 2013 at 5:11 PM, Arshak Navruzyan <ar...@gmail.com>wrote:
>
>> Accumulo 1.5.  Nothing in the *.err or *.out files on either the master
>> or the slave.
>>
>> Needless to say I can ssh from the master to the slave.
>>
>> Thanks!
>>
>>
>> On Tue, Dec 31, 2013 at 3:05 PM, Christopher <ct...@apache.org> wrote:
>>
>>> What version?
>>> Also, check the contents of the *.err and *.out logs.
>>>
>>>
>>> --
>>> Christopher L Tubbs II
>>> http://gravatar.com/ctubbsii
>>>
>>>
>>> On Tue, Dec 31, 2013 at 6:02 PM, Arshak Navruzyan <ar...@gmail.com>wrote:
>>>
>>>> I configured a new instance with a master and a slave tserver.  When I
>>>> do start-all on the master, the slave doesn't come up.  I am wondering if
>>>> it's because I left the instance secret as the default. (I get an exception
>>>> when I try to change that).
>>>>
>>>> This is what I see in the master's monitor regarding the slave
>>>>
>>>> Non-Functioning Tablet Servers
>>>> The following tablet servers reported a status other than Online
>>>>
>>>>  10.240.203.36:9997 UNRESPONSIVE
>>>> In the master log I see the following
>>>>
>>>> 2013-12-31 22:56:13,665 [master.Master] ERROR: unable to get tablet
>>>> server status 10.240.203.36:9997[1434a79d34404a2]
>>>> org.apache.thrift.transport.TTransportException:
>>>> java.net.NoRouteToHostException: No route to host
>>>> 2013-12-31 22:56:13,712 [master.Master] ERROR: unable to get tablet
>>>> server status 10.240.203.36:9997[1434a79d34404a2]
>>>> org.apache.thrift.transport.TTransportException:
>>>> java.net.NoRouteToHostException: No route to host
>>>> 2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO : Loaded
>>>> class org.apache.accumulo.server.master.balancer.DefaultLoadBalancer for
>>>> table !0
>>>> 2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1 tablets
>>>> 2013-12-31 22:56:13,812 [master.Master] ERROR: Error processing table
>>>> state for store Root Tablet
>>>> org.apache.thrift.transport.TTransportException:
>>>> java.net.NoRouteToHostException: No route to host
>>>>         at
>>>> org.apache.accumulo.core.client.impl.ThriftTransportPool.createNewTransport(ThriftTransportPool.java:475)
>>>>         at
>>>> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:464)
>>>>         at
>>>> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:441)
>>>>         at
>>>> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransportWithDefaultTimeout(ThriftTransportPool.java:366)
>>>>
>>>>
>>>>
>>>> In the slave's tserver.log all I see is
>>>>
>>>> 2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL: Lost tablet
>>>> server lock (reason = LOCK_DELETED), exiting.
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
> --
> Sean
>

Re: slave tserver not responding

Posted by Sean Busbey <bu...@clouderagovt.com>.
when you ssh, are you using hostnames or hte ip addresses?

is IPv6 present?


On Tue, Dec 31, 2013 at 5:11 PM, Arshak Navruzyan <ar...@gmail.com> wrote:

> Accumulo 1.5.  Nothing in the *.err or *.out files on either the master or
> the slave.
>
> Needless to say I can ssh from the master to the slave.
>
> Thanks!
>
>
> On Tue, Dec 31, 2013 at 3:05 PM, Christopher <ct...@apache.org> wrote:
>
>> What version?
>> Also, check the contents of the *.err and *.out logs.
>>
>>
>> --
>> Christopher L Tubbs II
>> http://gravatar.com/ctubbsii
>>
>>
>> On Tue, Dec 31, 2013 at 6:02 PM, Arshak Navruzyan <ar...@gmail.com>wrote:
>>
>>> I configured a new instance with a master and a slave tserver.  When I
>>> do start-all on the master, the slave doesn't come up.  I am wondering if
>>> it's because I left the instance secret as the default. (I get an exception
>>> when I try to change that).
>>>
>>> This is what I see in the master's monitor regarding the slave
>>>
>>> Non-Functioning Tablet Servers
>>> The following tablet servers reported a status other than Online
>>>
>>>  10.240.203.36:9997 UNRESPONSIVE
>>> In the master log I see the following
>>>
>>> 2013-12-31 22:56:13,665 [master.Master] ERROR: unable to get tablet
>>> server status 10.240.203.36:9997[1434a79d34404a2]
>>> org.apache.thrift.transport.TTransportException:
>>> java.net.NoRouteToHostException: No route to host
>>> 2013-12-31 22:56:13,712 [master.Master] ERROR: unable to get tablet
>>> server status 10.240.203.36:9997[1434a79d34404a2]
>>> org.apache.thrift.transport.TTransportException:
>>> java.net.NoRouteToHostException: No route to host
>>> 2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO : Loaded class
>>> org.apache.accumulo.server.master.balancer.DefaultLoadBalancer for table !0
>>> 2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1 tablets
>>> 2013-12-31 22:56:13,812 [master.Master] ERROR: Error processing table
>>> state for store Root Tablet
>>> org.apache.thrift.transport.TTransportException:
>>> java.net.NoRouteToHostException: No route to host
>>>         at
>>> org.apache.accumulo.core.client.impl.ThriftTransportPool.createNewTransport(ThriftTransportPool.java:475)
>>>         at
>>> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:464)
>>>         at
>>> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:441)
>>>         at
>>> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransportWithDefaultTimeout(ThriftTransportPool.java:366)
>>>
>>>
>>>
>>> In the slave's tserver.log all I see is
>>>
>>> 2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL: Lost tablet
>>> server lock (reason = LOCK_DELETED), exiting.
>>>
>>>
>>>
>>
>>
>


-- 
Sean

Re: slave tserver not responding

Posted by Arshak Navruzyan <ar...@gmail.com>.
Accumulo 1.5.  Nothing in the *.err or *.out files on either the master or
the slave.

Needless to say I can ssh from the master to the slave.

Thanks!


On Tue, Dec 31, 2013 at 3:05 PM, Christopher <ct...@apache.org> wrote:

> What version?
> Also, check the contents of the *.err and *.out logs.
>
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Tue, Dec 31, 2013 at 6:02 PM, Arshak Navruzyan <ar...@gmail.com>wrote:
>
>> I configured a new instance with a master and a slave tserver.  When I do
>> start-all on the master, the slave doesn't come up.  I am wondering if it's
>> because I left the instance secret as the default. (I get an exception when
>> I try to change that).
>>
>> This is what I see in the master's monitor regarding the slave
>>
>> Non-Functioning Tablet Servers
>> The following tablet servers reported a status other than Online
>>
>>  10.240.203.36:9997 UNRESPONSIVE
>> In the master log I see the following
>>
>> 2013-12-31 22:56:13,665 [master.Master] ERROR: unable to get tablet
>> server status 10.240.203.36:9997[1434a79d34404a2]
>> org.apache.thrift.transport.TTransportException:
>> java.net.NoRouteToHostException: No route to host
>> 2013-12-31 22:56:13,712 [master.Master] ERROR: unable to get tablet
>> server status 10.240.203.36:9997[1434a79d34404a2]
>> org.apache.thrift.transport.TTransportException:
>> java.net.NoRouteToHostException: No route to host
>> 2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO : Loaded class
>> org.apache.accumulo.server.master.balancer.DefaultLoadBalancer for table !0
>> 2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1 tablets
>> 2013-12-31 22:56:13,812 [master.Master] ERROR: Error processing table
>> state for store Root Tablet
>> org.apache.thrift.transport.TTransportException:
>> java.net.NoRouteToHostException: No route to host
>>         at
>> org.apache.accumulo.core.client.impl.ThriftTransportPool.createNewTransport(ThriftTransportPool.java:475)
>>         at
>> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:464)
>>         at
>> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:441)
>>         at
>> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransportWithDefaultTimeout(ThriftTransportPool.java:366)
>>
>>
>>
>> In the slave's tserver.log all I see is
>>
>> 2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL: Lost tablet
>> server lock (reason = LOCK_DELETED), exiting.
>>
>>
>>
>
>

Re: slave tserver not responding

Posted by Christopher <ct...@apache.org>.
What version?
Also, check the contents of the *.err and *.out logs.


--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Tue, Dec 31, 2013 at 6:02 PM, Arshak Navruzyan <ar...@gmail.com> wrote:

> I configured a new instance with a master and a slave tserver.  When I do
> start-all on the master, the slave doesn't come up.  I am wondering if it's
> because I left the instance secret as the default. (I get an exception when
> I try to change that).
>
> This is what I see in the master's monitor regarding the slave
>
> Non-Functioning Tablet Servers
> The following tablet servers reported a status other than Online
>
>  10.240.203.36:9997 UNRESPONSIVE
> In the master log I see the following
>
> 2013-12-31 22:56:13,665 [master.Master] ERROR: unable to get tablet server
> status 10.240.203.36:9997[1434a79d34404a2]
> org.apache.thrift.transport.TTransportException:
> java.net.NoRouteToHostException: No route to host
> 2013-12-31 22:56:13,712 [master.Master] ERROR: unable to get tablet server
> status 10.240.203.36:9997[1434a79d34404a2]
> org.apache.thrift.transport.TTransportException:
> java.net.NoRouteToHostException: No route to host
> 2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO : Loaded class
> org.apache.accumulo.server.master.balancer.DefaultLoadBalancer for table !0
> 2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1 tablets
> 2013-12-31 22:56:13,812 [master.Master] ERROR: Error processing table
> state for store Root Tablet
> org.apache.thrift.transport.TTransportException:
> java.net.NoRouteToHostException: No route to host
>         at
> org.apache.accumulo.core.client.impl.ThriftTransportPool.createNewTransport(ThriftTransportPool.java:475)
>         at
> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:464)
>         at
> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:441)
>         at
> org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransportWithDefaultTimeout(ThriftTransportPool.java:366)
>
>
>
> In the slave's tserver.log all I see is
>
> 2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL: Lost tablet
> server lock (reason = LOCK_DELETED), exiting.
>
>
>

Re: slave tserver not responding

Posted by Josh Elser <jo...@gmail.com>.
The BAD_CREDENTIALS error is just the root password not matching the 
trace.token.property.password. By default, the configurations set the 
password for Accumulo's distributed trace mechanism to be "secret".

It's best to make a special user and password for tracing and configure 
it in accumulo-site.xml. An easy way to get rid of that error is to just 
set the aforementioned property equal to the root password (and chmod 
600 accumulo-site.xml) ;)

On 1/1/14, 3:46 PM, Michael Wall wrote:
> I don't know if it helps debugging, but I am seeing the following in
> tserver_shrine.log
>
> 2014-01-01 06:15:37,852 [hdfs.DFSClient] INFO : Exception in
> createBlockOutputStream 10.240.165.43:50010 <http://10.240.165.43:50010>
> java.io.IOException: Bad connect ack with firstBadLink as
> 10.240.203.36:50010 <http://10.240.203.36:50010>
> 2014-01-01 06:15:37,852 [hdfs.DFSClient] INFO : Abandoning block
> blk_-2756969025267118869_1348
> 2014-01-01 06:15:37,855 [hdfs.DFSClient] INFO : Excluding datanode
> 10.240.203.36:50010 <http://10.240.203.36:50010>
> 2014-01-01 06:15:38,147 [hdfs.DFSClient] INFO : Exception in
> createBlockOutputStream 10.240.165.43:50010 <http://10.240.165.43:50010>
> java.io.IOException: Bad connect ack with firstBadLink as
> 10.240.203.36:50010 <http://10.240.203.36:50010>
> 2014-01-01 06:15:38,148 [hdfs.DFSClient] INFO : Abandoning block
> blk_2883724569463729419_1349
> 2014-01-01 06:15:38,149 [hdfs.DFSClient] INFO : Excluding datanode
> 10.240.203.36:50010 <http://10.240.203.36:50010>
> 2014-01-01 06:15:38,554 [client.ClientServiceHandler] ERROR:
> ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
> 2014-01-01 06:15:39,559 [client.ClientServiceHandler] ERROR:
> ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
> 2014-01-01 06:15:40,565 [client.ClientServiceHandler] ERROR:
> ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
> 2014-01-01 06:15:41,571 [client.ClientServiceHandler] ERROR:
> ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
> 2014-01-01 06:15:42,578 [client.ClientServiceHandler] ERROR:
> ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
> 2014-01-01 06:15:43,586 [client.ClientServiceHandler] ERROR:
> ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
> 2014-01-01 06:15:44,594 [client.ClientServiceHandler] ERROR:
> ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
>
>
>
> On Wed, Jan 1, 2014 at 2:28 PM, Josh Elser <josh.elser@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Sure -- you have my address already.
>
>     Also, nc not working while the tabletserver is dead makes sense
>     (that process is what's listening on that port). Once the process
>     dies, there's nothing else listening.
>
>
>     On 1/1/2014 1:31 PM, Arshak Navruzyan wrote:
>
>         If anyone wants to look at my live environment please let me
>         know (your
>         gmail) and I will add you to the Google Compute Engine.  Thanks!
>
>
>         On Wed, Jan 1, 2014 at 7:58 AM, Arshak Navruzyan
>         <arshakn@gmail.com <ma...@gmail.com>
>         <mailto:arshakn@gmail.com <ma...@gmail.com>>> wrote:
>
>              Sean
>
>              Thanks for looking into the log files.
>
>              These are two Google compute engine instance under the same
>         project
>              so there shouldn't be any firewall between them.
>
>              For the brief moment that the slave runs during startup, I
>         can nc
>              into port 9997 from the master to the slave.  But after it
>         crashes,
>              I can't.  Seems like somehow the problem is on the slave.
>
>              Arshak
>
>              On Dec 31, 2013 11:58 PM, "Sean Busbey"
>         <busbey+ml@clouderagovt.com <ma...@clouderagovt.com>
>              <mailto:busbey%2Bml@__clouderagovt.com
>         <ma...@clouderagovt.com>>> wrote:
>
>                  Well, I can tell you the proximal cause.  the tserver
>         log shows
>                  that it starts normally, then exits because it's told
>         to (via
>                  the zookeeper lock being removed).
>
>                  If you look at the master debug logs, this happens
>         because the
>                  master fails in three attempts to talk to the tserver,
>         all with
>                  the same error:
>
>                  2014-01-01 06:17:20,231 [master.Master] ERROR: unable
>         to get
>                  tablet server status 10.240.203.36:9997[__1434c70ed30001b]
>                  org.apache.thrift.transport.__TTransportException:
>         java.net <http://java.net>.__NoRouteToHostException: No route to
>         host
>
>                  Unfortunately, this is the same error you noticed in
>         your first
>                  email. After 3 of those, the master deletes the zk lock
>         so that
>                  the tserver will shutdown.
>
>                  Could there be another firewall blocking access to port
>         9997 on
>                  the worker machine from the master machine?
>
>                  Check from the master (you'll need netcat):
>
>                  $ nc -z 10.240.203.36 9997
>                  $ echo $?
>
>
>
>
>
>                  On Wed, Jan 1, 2014 at 12:33 AM, Arshak Navruzyan
>                  <arshakn@gmail.com <ma...@gmail.com>
>         <mailto:arshakn@gmail.com <ma...@gmail.com>>> wrote:
>
>                      I am probably missing something really basic so I
>         posted
>                      both the master and the slave log files:
>
>         https://www.dropbox.com/sh/__liv1mzuohyiv6uu/X5kx7AZJ6i
>         <https://www.dropbox.com/sh/liv1mzuohyiv6uu/X5kx7AZJ6i>
>
>                      Thanks again to everyone for the help!
>
>
>                      On Tue, Dec 31, 2013 at 10:20 PM, Arshak Navruzyan
>                      <arshakn@gmail.com <ma...@gmail.com>
>         <mailto:arshakn@gmail.com <ma...@gmail.com>>> wrote:
>
>                          disabled selinux (iptables already off) on both
>         master
>                          and slave but didn't make a difference
>         unfortunately.
>
>
>
>                          On Tue, Dec 31, 2013 at 9:25 PM, Kurt Christensen
>                          <hoodel@hoodel.com <ma...@hoodel.com>
>         <mailto:hoodel@hoodel.com <ma...@hoodel.com>>> wrote:
>
>
>                              SELINUX disabled? IPTABLES configured? I have
>                              nothing else.
>
>                              Kurt
>
>                              ------
>
>
>                              On 12/31/13 6:02 PM, Arshak Navruzyan wrote:
>
>                                  I configured a new instance with a
>         master and a
>                                  slave tserver.  When I do start-all on the
>                                  master, the slave doesn't come up.  I am
>                                  wondering if it's because I left the
>         instance
>                                  secret as the default. (I get an
>         exception when
>                                  I try to change that).
>
>                                  This is what I see in the master's monitor
>                                  regarding the slave
>
>                                       Non-Functioning Tablet Servers
>                                       The following tablet servers
>         reported a
>                                  status other than Online
>
>         10.240.203.36:9997 <http://10.240.203.36:9997>
>         <http://10.240.203.36:9997>
>                                  <http://10.240.203.36:9997>  UNRESPONSIVE
>
>
>
>                                  In the master log I see the following
>
>                                       2013-12-31 22:56:13,665
>         [master.Master]
>                                  ERROR: unable to get
>                                       tablet server status
>                                  10.240.203.36:9997[____1434a79d34404a2]
>
>
>         org.apache.thrift.transport.____TTransportException:
>         java.net <http://java.net>
>
>         <http://java.net>.____NoRouteToHostException: No
>
>                                  route to host
>                                       2013-12-31 22:56:13,712
>         [master.Master]
>                                  ERROR: unable to get
>                                       tablet server status
>                                  10.240.203.36:9997[____1434a79d34404a2]
>
>
>         org.apache.thrift.transport.____TTransportException:
>         java.net <http://java.net>
>
>         <http://java.net>.____NoRouteToHostException: No
>
>                                  route to host
>                                       2013-12-31 22:56:13,802
>                                  [balancer.TableLoadBalancer] INFO : Loaded
>                                       class
>
>
>         org.apache.accumulo.server.____master.balancer.____DefaultLoadBalancer
>
>                                  for
>                                       table !0
>                                       2013-12-31 22:56:13,803
>         [master.Master]
>                                  INFO : Assigning 1 tablets
>                                       2013-12-31 22:56:13,812
>         [master.Master]
>                                  ERROR: Error processing
>                                       table state for store Root Tablet
>
>
>         org.apache.thrift.transport.____TTransportException:
>         java.net <http://java.net>
>
>         <http://java.net>.____NoRouteToHostException: No
>                                  route to host
>                                               at
>
>
>         org.apache.accumulo.core.____client.impl.____ThriftTransportPool.____createNewTransport(____ThriftTransportPool.java:475)
>                                               at
>
>
>         org.apache.accumulo.core.____client.impl.____ThriftTransportPool.____getTransport(____ThriftTransportPool.java:464)
>                                               at
>
>
>         org.apache.accumulo.core.____client.impl.____ThriftTransportPool.____getTransport(____ThriftTransportPool.java:441)
>                                               at
>
>
>         org.apache.accumulo.core.____client.impl.____ThriftTransportPool.____getTransportWithDefaultTimeout____(ThriftTransportPool.java:__366)
>
>
>
>
>                                  In the slave's tserver.log all I see is
>
>                                       2013-12-31 22:56:34,731
>                                  [tabletserver.TabletServer] FATAL: Lost
>                                       tablet server lock (reason =
>         LOCK_DELETED),
>                                  exiting.
>
>
>                              --
>
>                              Kurt Christensen
>                              P.O. Box 811
>                              Westminster, MD 21158-0811
>
>
>         ------------------------------____----------------------------__--__------------
>
>                              If you can't explain it simply, you don't
>         understand
>                              it well enough. -- Albert Einstein
>
>
>
>
>
>
>                  --
>                  Sean
>
>
>

Re: slave tserver not responding

Posted by Michael Wall <mj...@gmail.com>.
I don't know if it helps debugging, but I am seeing the following in
tserver_shrine.log

2014-01-01 06:15:37,852 [hdfs.DFSClient] INFO : Exception in
createBlockOutputStream 10.240.165.43:50010 java.io.IOException: Bad
connect ack with firstBadLink as 10.240.203.36:50010
2014-01-01 06:15:37,852 [hdfs.DFSClient] INFO : Abandoning block
blk_-2756969025267118869_1348
2014-01-01 06:15:37,855 [hdfs.DFSClient] INFO : Excluding datanode
10.240.203.36:50010
2014-01-01 06:15:38,147 [hdfs.DFSClient] INFO : Exception in
createBlockOutputStream 10.240.165.43:50010 java.io.IOException: Bad
connect ack with firstBadLink as 10.240.203.36:50010
2014-01-01 06:15:38,148 [hdfs.DFSClient] INFO : Abandoning block
blk_2883724569463729419_1349
2014-01-01 06:15:38,149 [hdfs.DFSClient] INFO : Excluding datanode
10.240.203.36:50010
2014-01-01 06:15:38,554 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:39,559 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:40,565 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:41,571 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:42,578 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:43,586 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:44,594 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)



On Wed, Jan 1, 2014 at 2:28 PM, Josh Elser <jo...@gmail.com> wrote:

> Sure -- you have my address already.
>
> Also, nc not working while the tabletserver is dead makes sense (that
> process is what's listening on that port). Once the process dies, there's
> nothing else listening.
>
>
> On 1/1/2014 1:31 PM, Arshak Navruzyan wrote:
>
>> If anyone wants to look at my live environment please let me know (your
>> gmail) and I will add you to the Google Compute Engine.  Thanks!
>>
>>
>> On Wed, Jan 1, 2014 at 7:58 AM, Arshak Navruzyan <arshakn@gmail.com
>> <ma...@gmail.com>> wrote:
>>
>>     Sean
>>
>>     Thanks for looking into the log files.
>>
>>     These are two Google compute engine instance under the same project
>>     so there shouldn't be any firewall between them.
>>
>>     For the brief moment that the slave runs during startup, I can nc
>>     into port 9997 from the master to the slave.  But after it crashes,
>>     I can't.  Seems like somehow the problem is on the slave.
>>
>>     Arshak
>>
>>     On Dec 31, 2013 11:58 PM, "Sean Busbey" <busbey+ml@clouderagovt.com
>>     <ma...@clouderagovt.com>> wrote:
>>
>>         Well, I can tell you the proximal cause.  the tserver log shows
>>         that it starts normally, then exits because it's told to (via
>>         the zookeeper lock being removed).
>>
>>         If you look at the master debug logs, this happens because the
>>         master fails in three attempts to talk to the tserver, all with
>>         the same error:
>>
>>         2014-01-01 06:17:20,231 [master.Master] ERROR: unable to get
>>         tablet server status 10.240.203.36:9997[1434c70ed30001b]
>>         org.apache.thrift.transport.TTransportException:
>>         java.net.NoRouteToHostException: No route to host
>>
>>         Unfortunately, this is the same error you noticed in your first
>>         email. After 3 of those, the master deletes the zk lock so that
>>         the tserver will shutdown.
>>
>>         Could there be another firewall blocking access to port 9997 on
>>         the worker machine from the master machine?
>>
>>         Check from the master (you'll need netcat):
>>
>>         $ nc -z 10.240.203.36 9997
>>         $ echo $?
>>
>>
>>
>>
>>
>>         On Wed, Jan 1, 2014 at 12:33 AM, Arshak Navruzyan
>>         <arshakn@gmail.com <ma...@gmail.com>> wrote:
>>
>>             I am probably missing something really basic so I posted
>>             both the master and the slave log files:
>>
>>             https://www.dropbox.com/sh/liv1mzuohyiv6uu/X5kx7AZJ6i
>>
>>             Thanks again to everyone for the help!
>>
>>
>>             On Tue, Dec 31, 2013 at 10:20 PM, Arshak Navruzyan
>>             <arshakn@gmail.com <ma...@gmail.com>> wrote:
>>
>>                 disabled selinux (iptables already off) on both master
>>                 and slave but didn't make a difference unfortunately.
>>
>>
>>
>>                 On Tue, Dec 31, 2013 at 9:25 PM, Kurt Christensen
>>                 <hoodel@hoodel.com <ma...@hoodel.com>> wrote:
>>
>>
>>                     SELINUX disabled? IPTABLES configured? I have
>>                     nothing else.
>>
>>                     Kurt
>>
>>                     ------
>>
>>
>>                     On 12/31/13 6:02 PM, Arshak Navruzyan wrote:
>>
>>                         I configured a new instance with a master and a
>>                         slave tserver.  When I do start-all on the
>>                         master, the slave doesn't come up.  I am
>>                         wondering if it's because I left the instance
>>                         secret as the default. (I get an exception when
>>                         I try to change that).
>>
>>                         This is what I see in the master's monitor
>>                         regarding the slave
>>
>>                              Non-Functioning Tablet Servers
>>                              The following tablet servers reported a
>>                         status other than Online
>>
>>                         10.240.203.36:9997 <http://10.240.203.36:9997>
>>                         <http://10.240.203.36:9997>  UNRESPONSIVE
>>
>>
>>
>>                         In the master log I see the following
>>
>>                              2013-12-31 22:56:13,665 [master.Master]
>>                         ERROR: unable to get
>>                              tablet server status
>>                         10.240.203.36:9997[__1434a79d34404a2]
>>
>>                         org.apache.thrift.transport.__
>> TTransportException:
>>                         java.net
>>                         <http://java.net>.__NoRouteToHostException: No
>>
>>                         route to host
>>                              2013-12-31 22:56:13,712 [master.Master]
>>                         ERROR: unable to get
>>                              tablet server status
>>                         10.240.203.36:9997[__1434a79d34404a2]
>>
>>                         org.apache.thrift.transport.__
>> TTransportException:
>>                         java.net
>>                         <http://java.net>.__NoRouteToHostException: No
>>
>>                         route to host
>>                              2013-12-31 22:56:13,802
>>                         [balancer.TableLoadBalancer] INFO : Loaded
>>                              class
>>
>>                         org.apache.accumulo.server.__master.balancer.__
>> DefaultLoadBalancer
>>
>>                         for
>>                              table !0
>>                              2013-12-31 22:56:13,803 [master.Master]
>>                         INFO : Assigning 1 tablets
>>                              2013-12-31 22:56:13,812 [master.Master]
>>                         ERROR: Error processing
>>                              table state for store Root Tablet
>>
>>                         org.apache.thrift.transport.__
>> TTransportException:
>>                         java.net
>>                         <http://java.net>.__NoRouteToHostException: No
>>                         route to host
>>                                      at
>>
>>                         org.apache.accumulo.core.__client.impl.__
>> ThriftTransportPool.__createNewTransport(__ThriftTransportPool.java:475)
>>                                      at
>>
>>                         org.apache.accumulo.core.__client.impl.__
>> ThriftTransportPool.__getTransport(__ThriftTransportPool.java:464)
>>                                      at
>>
>>                         org.apache.accumulo.core.__client.impl.__
>> ThriftTransportPool.__getTransport(__ThriftTransportPool.java:441)
>>                                      at
>>
>>                         org.apache.accumulo.core.__client.impl.__
>> ThriftTransportPool.__getTransportWithDefaultTimeout
>> __(ThriftTransportPool.java:366)
>>
>>
>>
>>
>>                         In the slave's tserver.log all I see is
>>
>>                              2013-12-31 22:56:34,731
>>                         [tabletserver.TabletServer] FATAL: Lost
>>                              tablet server lock (reason = LOCK_DELETED),
>>                         exiting.
>>
>>
>>                     --
>>
>>                     Kurt Christensen
>>                     P.O. Box 811
>>                     Westminster, MD 21158-0811
>>
>>                     ------------------------------
>> __------------------------------__------------
>>
>>                     If you can't explain it simply, you don't understand
>>                     it well enough. -- Albert Einstein
>>
>>
>>
>>
>>
>>
>>         --
>>         Sean
>>
>>
>>

Re: slave tserver not responding

Posted by Josh Elser <jo...@gmail.com>.
Ok -- turned out to be a couple of little things, with one big one :D

The big one -- iptables was still running on the slave :)

I noticed that you were getting the same noroutetohost exceptions coming 
from the datanode logs trying to replicate, so I assume there was 
something outside of Hadoop. A `telnet slave_ip_addr port` on with the 
information that was showing up in the stack trace verified that I 
indeed could not. IPtables had an exception for SSH, so that's why 
SSH'ing still worked and Arshak could start the processes.

Small things:

It looked like IPv6 was still running via ifconfig, I disabled those via 
procfs and disabled them permanently via sysctl. That would have likely 
caused more trouble but I noticed this before iptables.

Max open files was still at 1024, which was likely to cause you more 
problems. I just upped them for the user you run Accumulo as.

- Josh

On 1/1/14, 2:28 PM, Josh Elser wrote:
> Sure -- you have my address already.
>
> Also, nc not working while the tabletserver is dead makes sense (that
> process is what's listening on that port). Once the process dies,
> there's nothing else listening.
>
> On 1/1/2014 1:31 PM, Arshak Navruzyan wrote:
>> If anyone wants to look at my live environment please let me know (your
>> gmail) and I will add you to the Google Compute Engine.  Thanks!
>>
>>
>> On Wed, Jan 1, 2014 at 7:58 AM, Arshak Navruzyan <arshakn@gmail.com
>> <ma...@gmail.com>> wrote:
>>
>>     Sean
>>
>>     Thanks for looking into the log files.
>>
>>     These are two Google compute engine instance under the same project
>>     so there shouldn't be any firewall between them.
>>
>>     For the brief moment that the slave runs during startup, I can nc
>>     into port 9997 from the master to the slave.  But after it crashes,
>>     I can't.  Seems like somehow the problem is on the slave.
>>
>>     Arshak
>>
>>     On Dec 31, 2013 11:58 PM, "Sean Busbey" <busbey+ml@clouderagovt.com
>>     <ma...@clouderagovt.com>> wrote:
>>
>>         Well, I can tell you the proximal cause.  the tserver log shows
>>         that it starts normally, then exits because it's told to (via
>>         the zookeeper lock being removed).
>>
>>         If you look at the master debug logs, this happens because the
>>         master fails in three attempts to talk to the tserver, all with
>>         the same error:
>>
>>         2014-01-01 06:17:20,231 [master.Master] ERROR: unable to get
>>         tablet server status 10.240.203.36:9997[1434c70ed30001b]
>>         org.apache.thrift.transport.TTransportException:
>>         java.net.NoRouteToHostException: No route to host
>>
>>         Unfortunately, this is the same error you noticed in your first
>>         email. After 3 of those, the master deletes the zk lock so that
>>         the tserver will shutdown.
>>
>>         Could there be another firewall blocking access to port 9997 on
>>         the worker machine from the master machine?
>>
>>         Check from the master (you'll need netcat):
>>
>>         $ nc -z 10.240.203.36 9997
>>         $ echo $?
>>
>>
>>
>>
>>
>>         On Wed, Jan 1, 2014 at 12:33 AM, Arshak Navruzyan
>>         <arshakn@gmail.com <ma...@gmail.com>> wrote:
>>
>>             I am probably missing something really basic so I posted
>>             both the master and the slave log files:
>>
>>             https://www.dropbox.com/sh/liv1mzuohyiv6uu/X5kx7AZJ6i
>>
>>             Thanks again to everyone for the help!
>>
>>
>>             On Tue, Dec 31, 2013 at 10:20 PM, Arshak Navruzyan
>>             <arshakn@gmail.com <ma...@gmail.com>> wrote:
>>
>>                 disabled selinux (iptables already off) on both master
>>                 and slave but didn't make a difference unfortunately.
>>
>>
>>
>>                 On Tue, Dec 31, 2013 at 9:25 PM, Kurt Christensen
>>                 <hoodel@hoodel.com <ma...@hoodel.com>> wrote:
>>
>>
>>                     SELINUX disabled? IPTABLES configured? I have
>>                     nothing else.
>>
>>                     Kurt
>>
>>                     ------
>>
>>
>>                     On 12/31/13 6:02 PM, Arshak Navruzyan wrote:
>>
>>                         I configured a new instance with a master and a
>>                         slave tserver.  When I do start-all on the
>>                         master, the slave doesn't come up.  I am
>>                         wondering if it's because I left the instance
>>                         secret as the default. (I get an exception when
>>                         I try to change that).
>>
>>                         This is what I see in the master's monitor
>>                         regarding the slave
>>
>>                              Non-Functioning Tablet Servers
>>                              The following tablet servers reported a
>>                         status other than Online
>>
>>                         10.240.203.36:9997 <http://10.240.203.36:9997>
>>                         <http://10.240.203.36:9997>  UNRESPONSIVE
>>
>>
>>
>>                         In the master log I see the following
>>
>>                              2013-12-31 22:56:13,665 [master.Master]
>>                         ERROR: unable to get
>>                              tablet server status
>>                         10.240.203.36:9997[__1434a79d34404a2]
>>
>>
>> org.apache.thrift.transport.__TTransportException:
>>                         java.net
>>                         <http://java.net>.__NoRouteToHostException: No
>>                         route to host
>>                              2013-12-31 22:56:13,712 [master.Master]
>>                         ERROR: unable to get
>>                              tablet server status
>>                         10.240.203.36:9997[__1434a79d34404a2]
>>
>>
>> org.apache.thrift.transport.__TTransportException:
>>                         java.net
>>                         <http://java.net>.__NoRouteToHostException: No
>>                         route to host
>>                              2013-12-31 22:56:13,802
>>                         [balancer.TableLoadBalancer] INFO : Loaded
>>                              class
>>
>>
>> org.apache.accumulo.server.__master.balancer.__DefaultLoadBalancer
>>                         for
>>                              table !0
>>                              2013-12-31 22:56:13,803 [master.Master]
>>                         INFO : Assigning 1 tablets
>>                              2013-12-31 22:56:13,812 [master.Master]
>>                         ERROR: Error processing
>>                              table state for store Root Tablet
>>
>>
>> org.apache.thrift.transport.__TTransportException:
>>                         java.net
>>                         <http://java.net>.__NoRouteToHostException: No
>>                         route to host
>>                                      at
>>
>>
>> org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__createNewTransport(__ThriftTransportPool.java:475)
>>
>>                                      at
>>
>>
>> org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__getTransport(__ThriftTransportPool.java:464)
>>
>>                                      at
>>
>>
>> org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__getTransport(__ThriftTransportPool.java:441)
>>
>>                                      at
>>
>>
>> org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__getTransportWithDefaultTimeout__(ThriftTransportPool.java:366)
>>
>>
>>
>>
>>                         In the slave's tserver.log all I see is
>>
>>                              2013-12-31 22:56:34,731
>>                         [tabletserver.TabletServer] FATAL: Lost
>>                              tablet server lock (reason = LOCK_DELETED),
>>                         exiting.
>>
>>
>>                     --
>>
>>                     Kurt Christensen
>>                     P.O. Box 811
>>                     Westminster, MD 21158-0811
>>
>>
>> ------------------------------__------------------------------__------------
>>
>>                     If you can't explain it simply, you don't understand
>>                     it well enough. -- Albert Einstein
>>
>>
>>
>>
>>
>>
>>         --
>>         Sean
>>
>>

Re: slave tserver not responding

Posted by Josh Elser <jo...@gmail.com>.
Sure -- you have my address already.

Also, nc not working while the tabletserver is dead makes sense (that 
process is what's listening on that port). Once the process dies, 
there's nothing else listening.

On 1/1/2014 1:31 PM, Arshak Navruzyan wrote:
> If anyone wants to look at my live environment please let me know (your
> gmail) and I will add you to the Google Compute Engine.  Thanks!
>
>
> On Wed, Jan 1, 2014 at 7:58 AM, Arshak Navruzyan <arshakn@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Sean
>
>     Thanks for looking into the log files.
>
>     These are two Google compute engine instance under the same project
>     so there shouldn't be any firewall between them.
>
>     For the brief moment that the slave runs during startup, I can nc
>     into port 9997 from the master to the slave.  But after it crashes,
>     I can't.  Seems like somehow the problem is on the slave.
>
>     Arshak
>
>     On Dec 31, 2013 11:58 PM, "Sean Busbey" <busbey+ml@clouderagovt.com
>     <ma...@clouderagovt.com>> wrote:
>
>         Well, I can tell you the proximal cause.  the tserver log shows
>         that it starts normally, then exits because it's told to (via
>         the zookeeper lock being removed).
>
>         If you look at the master debug logs, this happens because the
>         master fails in three attempts to talk to the tserver, all with
>         the same error:
>
>         2014-01-01 06:17:20,231 [master.Master] ERROR: unable to get
>         tablet server status 10.240.203.36:9997[1434c70ed30001b]
>         org.apache.thrift.transport.TTransportException:
>         java.net.NoRouteToHostException: No route to host
>
>         Unfortunately, this is the same error you noticed in your first
>         email. After 3 of those, the master deletes the zk lock so that
>         the tserver will shutdown.
>
>         Could there be another firewall blocking access to port 9997 on
>         the worker machine from the master machine?
>
>         Check from the master (you'll need netcat):
>
>         $ nc -z 10.240.203.36 9997
>         $ echo $?
>
>
>
>
>
>         On Wed, Jan 1, 2014 at 12:33 AM, Arshak Navruzyan
>         <arshakn@gmail.com <ma...@gmail.com>> wrote:
>
>             I am probably missing something really basic so I posted
>             both the master and the slave log files:
>
>             https://www.dropbox.com/sh/liv1mzuohyiv6uu/X5kx7AZJ6i
>
>             Thanks again to everyone for the help!
>
>
>             On Tue, Dec 31, 2013 at 10:20 PM, Arshak Navruzyan
>             <arshakn@gmail.com <ma...@gmail.com>> wrote:
>
>                 disabled selinux (iptables already off) on both master
>                 and slave but didn't make a difference unfortunately.
>
>
>
>                 On Tue, Dec 31, 2013 at 9:25 PM, Kurt Christensen
>                 <hoodel@hoodel.com <ma...@hoodel.com>> wrote:
>
>
>                     SELINUX disabled? IPTABLES configured? I have
>                     nothing else.
>
>                     Kurt
>
>                     ------
>
>
>                     On 12/31/13 6:02 PM, Arshak Navruzyan wrote:
>
>                         I configured a new instance with a master and a
>                         slave tserver.  When I do start-all on the
>                         master, the slave doesn't come up.  I am
>                         wondering if it's because I left the instance
>                         secret as the default. (I get an exception when
>                         I try to change that).
>
>                         This is what I see in the master's monitor
>                         regarding the slave
>
>                              Non-Functioning Tablet Servers
>                              The following tablet servers reported a
>                         status other than Online
>
>                         10.240.203.36:9997 <http://10.240.203.36:9997>
>                         <http://10.240.203.36:9997>  UNRESPONSIVE
>
>
>
>                         In the master log I see the following
>
>                              2013-12-31 22:56:13,665 [master.Master]
>                         ERROR: unable to get
>                              tablet server status
>                         10.240.203.36:9997[__1434a79d34404a2]
>
>                         org.apache.thrift.transport.__TTransportException:
>                         java.net
>                         <http://java.net>.__NoRouteToHostException: No
>                         route to host
>                              2013-12-31 22:56:13,712 [master.Master]
>                         ERROR: unable to get
>                              tablet server status
>                         10.240.203.36:9997[__1434a79d34404a2]
>
>                         org.apache.thrift.transport.__TTransportException:
>                         java.net
>                         <http://java.net>.__NoRouteToHostException: No
>                         route to host
>                              2013-12-31 22:56:13,802
>                         [balancer.TableLoadBalancer] INFO : Loaded
>                              class
>
>                         org.apache.accumulo.server.__master.balancer.__DefaultLoadBalancer
>                         for
>                              table !0
>                              2013-12-31 22:56:13,803 [master.Master]
>                         INFO : Assigning 1 tablets
>                              2013-12-31 22:56:13,812 [master.Master]
>                         ERROR: Error processing
>                              table state for store Root Tablet
>
>                         org.apache.thrift.transport.__TTransportException:
>                         java.net
>                         <http://java.net>.__NoRouteToHostException: No
>                         route to host
>                                      at
>
>                         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__createNewTransport(__ThriftTransportPool.java:475)
>                                      at
>
>                         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__getTransport(__ThriftTransportPool.java:464)
>                                      at
>
>                         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__getTransport(__ThriftTransportPool.java:441)
>                                      at
>
>                         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__getTransportWithDefaultTimeout__(ThriftTransportPool.java:366)
>
>
>
>                         In the slave's tserver.log all I see is
>
>                              2013-12-31 22:56:34,731
>                         [tabletserver.TabletServer] FATAL: Lost
>                              tablet server lock (reason = LOCK_DELETED),
>                         exiting.
>
>
>                     --
>
>                     Kurt Christensen
>                     P.O. Box 811
>                     Westminster, MD 21158-0811
>
>                     ------------------------------__------------------------------__------------
>                     If you can't explain it simply, you don't understand
>                     it well enough. -- Albert Einstein
>
>
>
>
>
>
>         --
>         Sean
>
>

Re: slave tserver not responding

Posted by Arshak Navruzyan <ar...@gmail.com>.
If anyone wants to look at my live environment please let me know (your
gmail) and I will add you to the Google Compute Engine.  Thanks!


On Wed, Jan 1, 2014 at 7:58 AM, Arshak Navruzyan <ar...@gmail.com> wrote:

> Sean
>
> Thanks for looking into the log files.
>
> These are two Google compute engine instance under the same project so
> there shouldn't be any firewall between them.
>
> For the brief moment that the slave runs during startup, I can nc into
> port 9997 from the master to the slave.  But after it crashes, I can't.
> Seems like somehow the problem is on the slave.
>
> Arshak
> On Dec 31, 2013 11:58 PM, "Sean Busbey" <bu...@clouderagovt.com>
> wrote:
>
>> Well, I can tell you the proximal cause.  the tserver log shows that it
>> starts normally, then exits because it's told to (via the zookeeper lock
>> being removed).
>>
>> If you look at the master debug logs, this happens because the master
>> fails in three attempts to talk to the tserver, all with the same error:
>>
>> 2014-01-01 06:17:20,231 [master.Master] ERROR: unable to get tablet
>> server status 10.240.203.36:9997[1434c70ed30001b]
>> org.apache.thrift.transport.TTransportException:
>> java.net.NoRouteToHostException: No route to host
>>
>> Unfortunately, this is the same error you noticed in your first email.
>> After 3 of those, the master deletes the zk lock so that the tserver will
>> shutdown.
>>
>> Could there be another firewall blocking access to port 9997 on the
>> worker machine from the master machine?
>>
>> Check from the master (you'll need netcat):
>>
>> $ nc -z 10.240.203.36 9997
>> $ echo $?
>>
>>
>>
>>
>>
>> On Wed, Jan 1, 2014 at 12:33 AM, Arshak Navruzyan <ar...@gmail.com>wrote:
>>
>>> I am probably missing something really basic so I posted both the master
>>> and the slave log files:
>>>
>>> https://www.dropbox.com/sh/liv1mzuohyiv6uu/X5kx7AZJ6i
>>>
>>> Thanks again to everyone for the help!
>>>
>>>
>>> On Tue, Dec 31, 2013 at 10:20 PM, Arshak Navruzyan <ar...@gmail.com>wrote:
>>>
>>>> disabled selinux (iptables already off) on both master and slave but
>>>> didn't make a difference unfortunately.
>>>>
>>>>
>>>>
>>>> On Tue, Dec 31, 2013 at 9:25 PM, Kurt Christensen <ho...@hoodel.com>wrote:
>>>>
>>>>>
>>>>> SELINUX disabled? IPTABLES configured? I have nothing else.
>>>>>
>>>>> Kurt
>>>>>
>>>>> ------
>>>>>
>>>>>
>>>>> On 12/31/13 6:02 PM, Arshak Navruzyan wrote:
>>>>>
>>>>>> I configured a new instance with a master and a slave tserver.  When
>>>>>> I do start-all on the master, the slave doesn't come up.  I am wondering if
>>>>>> it's because I left the instance secret as the default. (I get an exception
>>>>>> when I try to change that).
>>>>>>
>>>>>> This is what I see in the master's monitor regarding the slave
>>>>>>
>>>>>>     Non-Functioning Tablet Servers
>>>>>>     The following tablet servers reported a status other than Online
>>>>>>
>>>>>> 10.240.203.36:9997 <http://10.240.203.36:9997>  UNRESPONSIVE
>>>>>>
>>>>>>
>>>>>>
>>>>>> In the master log I see the following
>>>>>>
>>>>>>     2013-12-31 22:56:13,665 [master.Master] ERROR: unable to get
>>>>>>     tablet server status 10.240.203.36:9997[1434a79d34404a2]
>>>>>>     org.apache.thrift.transport.TTransportException:
>>>>>>     java.net.NoRouteToHostException: No route to host
>>>>>>     2013-12-31 22:56:13,712 [master.Master] ERROR: unable to get
>>>>>>     tablet server status 10.240.203.36:9997[1434a79d34404a2]
>>>>>>     org.apache.thrift.transport.TTransportException:
>>>>>>     java.net.NoRouteToHostException: No route to host
>>>>>>     2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO : Loaded
>>>>>>     class
>>>>>>     org.apache.accumulo.server.master.balancer.DefaultLoadBalancer
>>>>>> for
>>>>>>     table !0
>>>>>>     2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1 tablets
>>>>>>     2013-12-31 22:56:13,812 [master.Master] ERROR: Error processing
>>>>>>     table state for store Root Tablet
>>>>>>     org.apache.thrift.transport.TTransportException:
>>>>>>     java.net.NoRouteToHostException: No route to host
>>>>>>             at
>>>>>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>>>>>> createNewTransport(ThriftTransportPool.java:475)
>>>>>>             at
>>>>>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>>>>>> getTransport(ThriftTransportPool.java:464)
>>>>>>             at
>>>>>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>>>>>> getTransport(ThriftTransportPool.java:441)
>>>>>>             at
>>>>>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>>>>>> getTransportWithDefaultTimeout(ThriftTransportPool.java:366)
>>>>>>
>>>>>>
>>>>>>
>>>>>> In the slave's tserver.log all I see is
>>>>>>
>>>>>>     2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL: Lost
>>>>>>     tablet server lock (reason = LOCK_DELETED), exiting.
>>>>>>
>>>>>>
>>>>> --
>>>>>
>>>>> Kurt Christensen
>>>>> P.O. Box 811
>>>>> Westminster, MD 21158-0811
>>>>>
>>>>> ------------------------------------------------------------
>>>>> ------------
>>>>> If you can't explain it simply, you don't understand it well enough.
>>>>> -- Albert Einstein
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Sean
>>
>

Re: slave tserver not responding

Posted by Arshak Navruzyan <ar...@gmail.com>.
Sean

Thanks for looking into the log files.

These are two Google compute engine instance under the same project so
there shouldn't be any firewall between them.

For the brief moment that the slave runs during startup, I can nc into port
9997 from the master to the slave.  But after it crashes, I can't.  Seems
like somehow the problem is on the slave.

Arshak
On Dec 31, 2013 11:58 PM, "Sean Busbey" <bu...@clouderagovt.com> wrote:

> Well, I can tell you the proximal cause.  the tserver log shows that it
> starts normally, then exits because it's told to (via the zookeeper lock
> being removed).
>
> If you look at the master debug logs, this happens because the master
> fails in three attempts to talk to the tserver, all with the same error:
>
> 2014-01-01 06:17:20,231 [master.Master] ERROR: unable to get tablet server
> status 10.240.203.36:9997[1434c70ed30001b]
> org.apache.thrift.transport.TTransportException:
> java.net.NoRouteToHostException: No route to host
>
> Unfortunately, this is the same error you noticed in your first email.
> After 3 of those, the master deletes the zk lock so that the tserver will
> shutdown.
>
> Could there be another firewall blocking access to port 9997 on the worker
> machine from the master machine?
>
> Check from the master (you'll need netcat):
>
> $ nc -z 10.240.203.36 9997
> $ echo $?
>
>
>
>
>
> On Wed, Jan 1, 2014 at 12:33 AM, Arshak Navruzyan <ar...@gmail.com>wrote:
>
>> I am probably missing something really basic so I posted both the master
>> and the slave log files:
>>
>> https://www.dropbox.com/sh/liv1mzuohyiv6uu/X5kx7AZJ6i
>>
>> Thanks again to everyone for the help!
>>
>>
>> On Tue, Dec 31, 2013 at 10:20 PM, Arshak Navruzyan <ar...@gmail.com>wrote:
>>
>>> disabled selinux (iptables already off) on both master and slave but
>>> didn't make a difference unfortunately.
>>>
>>>
>>>
>>> On Tue, Dec 31, 2013 at 9:25 PM, Kurt Christensen <ho...@hoodel.com>wrote:
>>>
>>>>
>>>> SELINUX disabled? IPTABLES configured? I have nothing else.
>>>>
>>>> Kurt
>>>>
>>>> ------
>>>>
>>>>
>>>> On 12/31/13 6:02 PM, Arshak Navruzyan wrote:
>>>>
>>>>> I configured a new instance with a master and a slave tserver.  When I
>>>>> do start-all on the master, the slave doesn't come up.  I am wondering if
>>>>> it's because I left the instance secret as the default. (I get an exception
>>>>> when I try to change that).
>>>>>
>>>>> This is what I see in the master's monitor regarding the slave
>>>>>
>>>>>     Non-Functioning Tablet Servers
>>>>>     The following tablet servers reported a status other than Online
>>>>>
>>>>> 10.240.203.36:9997 <http://10.240.203.36:9997>  UNRESPONSIVE
>>>>>
>>>>>
>>>>>
>>>>> In the master log I see the following
>>>>>
>>>>>     2013-12-31 22:56:13,665 [master.Master] ERROR: unable to get
>>>>>     tablet server status 10.240.203.36:9997[1434a79d34404a2]
>>>>>     org.apache.thrift.transport.TTransportException:
>>>>>     java.net.NoRouteToHostException: No route to host
>>>>>     2013-12-31 22:56:13,712 [master.Master] ERROR: unable to get
>>>>>     tablet server status 10.240.203.36:9997[1434a79d34404a2]
>>>>>     org.apache.thrift.transport.TTransportException:
>>>>>     java.net.NoRouteToHostException: No route to host
>>>>>     2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO : Loaded
>>>>>     class
>>>>>     org.apache.accumulo.server.master.balancer.DefaultLoadBalancer for
>>>>>     table !0
>>>>>     2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1 tablets
>>>>>     2013-12-31 22:56:13,812 [master.Master] ERROR: Error processing
>>>>>     table state for store Root Tablet
>>>>>     org.apache.thrift.transport.TTransportException:
>>>>>     java.net.NoRouteToHostException: No route to host
>>>>>             at
>>>>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>>>>> createNewTransport(ThriftTransportPool.java:475)
>>>>>             at
>>>>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>>>>> getTransport(ThriftTransportPool.java:464)
>>>>>             at
>>>>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>>>>> getTransport(ThriftTransportPool.java:441)
>>>>>             at
>>>>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>>>>> getTransportWithDefaultTimeout(ThriftTransportPool.java:366)
>>>>>
>>>>>
>>>>>
>>>>> In the slave's tserver.log all I see is
>>>>>
>>>>>     2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL: Lost
>>>>>     tablet server lock (reason = LOCK_DELETED), exiting.
>>>>>
>>>>>
>>>> --
>>>>
>>>> Kurt Christensen
>>>> P.O. Box 811
>>>> Westminster, MD 21158-0811
>>>>
>>>> ------------------------------------------------------------
>>>> ------------
>>>> If you can't explain it simply, you don't understand it well enough. --
>>>> Albert Einstein
>>>>
>>>
>>>
>>
>
>
> --
> Sean
>

Re: slave tserver not responding

Posted by Sean Busbey <bu...@clouderagovt.com>.
Well, I can tell you the proximal cause.  the tserver log shows that it
starts normally, then exits because it's told to (via the zookeeper lock
being removed).

If you look at the master debug logs, this happens because the master fails
in three attempts to talk to the tserver, all with the same error:

2014-01-01 06:17:20,231 [master.Master] ERROR: unable to get tablet server
status 10.240.203.36:9997[1434c70ed30001b]
org.apache.thrift.transport.TTransportException:
java.net.NoRouteToHostException: No route to host

Unfortunately, this is the same error you noticed in your first email.
After 3 of those, the master deletes the zk lock so that the tserver will
shutdown.

Could there be another firewall blocking access to port 9997 on the worker
machine from the master machine?

Check from the master (you'll need netcat):

$ nc -z 10.240.203.36 9997
$ echo $?





On Wed, Jan 1, 2014 at 12:33 AM, Arshak Navruzyan <ar...@gmail.com> wrote:

> I am probably missing something really basic so I posted both the master
> and the slave log files:
>
> https://www.dropbox.com/sh/liv1mzuohyiv6uu/X5kx7AZJ6i
>
> Thanks again to everyone for the help!
>
>
> On Tue, Dec 31, 2013 at 10:20 PM, Arshak Navruzyan <ar...@gmail.com>wrote:
>
>> disabled selinux (iptables already off) on both master and slave but
>> didn't make a difference unfortunately.
>>
>>
>>
>> On Tue, Dec 31, 2013 at 9:25 PM, Kurt Christensen <ho...@hoodel.com>wrote:
>>
>>>
>>> SELINUX disabled? IPTABLES configured? I have nothing else.
>>>
>>> Kurt
>>>
>>> ------
>>>
>>>
>>> On 12/31/13 6:02 PM, Arshak Navruzyan wrote:
>>>
>>>> I configured a new instance with a master and a slave tserver.  When I
>>>> do start-all on the master, the slave doesn't come up.  I am wondering if
>>>> it's because I left the instance secret as the default. (I get an exception
>>>> when I try to change that).
>>>>
>>>> This is what I see in the master's monitor regarding the slave
>>>>
>>>>     Non-Functioning Tablet Servers
>>>>     The following tablet servers reported a status other than Online
>>>>
>>>> 10.240.203.36:9997 <http://10.240.203.36:9997>  UNRESPONSIVE
>>>>
>>>>
>>>>
>>>> In the master log I see the following
>>>>
>>>>     2013-12-31 22:56:13,665 [master.Master] ERROR: unable to get
>>>>     tablet server status 10.240.203.36:9997[1434a79d34404a2]
>>>>     org.apache.thrift.transport.TTransportException:
>>>>     java.net.NoRouteToHostException: No route to host
>>>>     2013-12-31 22:56:13,712 [master.Master] ERROR: unable to get
>>>>     tablet server status 10.240.203.36:9997[1434a79d34404a2]
>>>>     org.apache.thrift.transport.TTransportException:
>>>>     java.net.NoRouteToHostException: No route to host
>>>>     2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO : Loaded
>>>>     class
>>>>     org.apache.accumulo.server.master.balancer.DefaultLoadBalancer for
>>>>     table !0
>>>>     2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1 tablets
>>>>     2013-12-31 22:56:13,812 [master.Master] ERROR: Error processing
>>>>     table state for store Root Tablet
>>>>     org.apache.thrift.transport.TTransportException:
>>>>     java.net.NoRouteToHostException: No route to host
>>>>             at
>>>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>>>> createNewTransport(ThriftTransportPool.java:475)
>>>>             at
>>>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>>>> getTransport(ThriftTransportPool.java:464)
>>>>             at
>>>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>>>> getTransport(ThriftTransportPool.java:441)
>>>>             at
>>>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>>>> getTransportWithDefaultTimeout(ThriftTransportPool.java:366)
>>>>
>>>>
>>>>
>>>> In the slave's tserver.log all I see is
>>>>
>>>>     2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL: Lost
>>>>     tablet server lock (reason = LOCK_DELETED), exiting.
>>>>
>>>>
>>> --
>>>
>>> Kurt Christensen
>>> P.O. Box 811
>>> Westminster, MD 21158-0811
>>>
>>> ------------------------------------------------------------------------
>>> If you can't explain it simply, you don't understand it well enough. --
>>> Albert Einstein
>>>
>>
>>
>


-- 
Sean

Re: slave tserver not responding

Posted by Arshak Navruzyan <ar...@gmail.com>.
I am probably missing something really basic so I posted both the master
and the slave log files:

https://www.dropbox.com/sh/liv1mzuohyiv6uu/X5kx7AZJ6i

Thanks again to everyone for the help!


On Tue, Dec 31, 2013 at 10:20 PM, Arshak Navruzyan <ar...@gmail.com>wrote:

> disabled selinux (iptables already off) on both master and slave but
> didn't make a difference unfortunately.
>
>
>
> On Tue, Dec 31, 2013 at 9:25 PM, Kurt Christensen <ho...@hoodel.com>wrote:
>
>>
>> SELINUX disabled? IPTABLES configured? I have nothing else.
>>
>> Kurt
>>
>> ------
>>
>>
>> On 12/31/13 6:02 PM, Arshak Navruzyan wrote:
>>
>>> I configured a new instance with a master and a slave tserver.  When I
>>> do start-all on the master, the slave doesn't come up.  I am wondering if
>>> it's because I left the instance secret as the default. (I get an exception
>>> when I try to change that).
>>>
>>> This is what I see in the master's monitor regarding the slave
>>>
>>>     Non-Functioning Tablet Servers
>>>     The following tablet servers reported a status other than Online
>>>
>>> 10.240.203.36:9997 <http://10.240.203.36:9997>  UNRESPONSIVE
>>>
>>>
>>>
>>> In the master log I see the following
>>>
>>>     2013-12-31 22:56:13,665 [master.Master] ERROR: unable to get
>>>     tablet server status 10.240.203.36:9997[1434a79d34404a2]
>>>     org.apache.thrift.transport.TTransportException:
>>>     java.net.NoRouteToHostException: No route to host
>>>     2013-12-31 22:56:13,712 [master.Master] ERROR: unable to get
>>>     tablet server status 10.240.203.36:9997[1434a79d34404a2]
>>>     org.apache.thrift.transport.TTransportException:
>>>     java.net.NoRouteToHostException: No route to host
>>>     2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO : Loaded
>>>     class
>>>     org.apache.accumulo.server.master.balancer.DefaultLoadBalancer for
>>>     table !0
>>>     2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1 tablets
>>>     2013-12-31 22:56:13,812 [master.Master] ERROR: Error processing
>>>     table state for store Root Tablet
>>>     org.apache.thrift.transport.TTransportException:
>>>     java.net.NoRouteToHostException: No route to host
>>>             at
>>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>>> createNewTransport(ThriftTransportPool.java:475)
>>>             at
>>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>>> getTransport(ThriftTransportPool.java:464)
>>>             at
>>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>>> getTransport(ThriftTransportPool.java:441)
>>>             at
>>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>>> getTransportWithDefaultTimeout(ThriftTransportPool.java:366)
>>>
>>>
>>>
>>> In the slave's tserver.log all I see is
>>>
>>>     2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL: Lost
>>>     tablet server lock (reason = LOCK_DELETED), exiting.
>>>
>>>
>> --
>>
>> Kurt Christensen
>> P.O. Box 811
>> Westminster, MD 21158-0811
>>
>> ------------------------------------------------------------------------
>> If you can't explain it simply, you don't understand it well enough. --
>> Albert Einstein
>>
>
>

Re: slave tserver not responding

Posted by Arshak Navruzyan <ar...@gmail.com>.
disabled selinux (iptables already off) on both master and slave but didn't
make a difference unfortunately.



On Tue, Dec 31, 2013 at 9:25 PM, Kurt Christensen <ho...@hoodel.com> wrote:

>
> SELINUX disabled? IPTABLES configured? I have nothing else.
>
> Kurt
>
> ------
>
>
> On 12/31/13 6:02 PM, Arshak Navruzyan wrote:
>
>> I configured a new instance with a master and a slave tserver.  When I do
>> start-all on the master, the slave doesn't come up.  I am wondering if it's
>> because I left the instance secret as the default. (I get an exception when
>> I try to change that).
>>
>> This is what I see in the master's monitor regarding the slave
>>
>>     Non-Functioning Tablet Servers
>>     The following tablet servers reported a status other than Online
>>
>> 10.240.203.36:9997 <http://10.240.203.36:9997>  UNRESPONSIVE
>>
>>
>>
>> In the master log I see the following
>>
>>     2013-12-31 22:56:13,665 [master.Master] ERROR: unable to get
>>     tablet server status 10.240.203.36:9997[1434a79d34404a2]
>>     org.apache.thrift.transport.TTransportException:
>>     java.net.NoRouteToHostException: No route to host
>>     2013-12-31 22:56:13,712 [master.Master] ERROR: unable to get
>>     tablet server status 10.240.203.36:9997[1434a79d34404a2]
>>     org.apache.thrift.transport.TTransportException:
>>     java.net.NoRouteToHostException: No route to host
>>     2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO : Loaded
>>     class
>>     org.apache.accumulo.server.master.balancer.DefaultLoadBalancer for
>>     table !0
>>     2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1 tablets
>>     2013-12-31 22:56:13,812 [master.Master] ERROR: Error processing
>>     table state for store Root Tablet
>>     org.apache.thrift.transport.TTransportException:
>>     java.net.NoRouteToHostException: No route to host
>>             at
>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>> createNewTransport(ThriftTransportPool.java:475)
>>             at
>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>> getTransport(ThriftTransportPool.java:464)
>>             at
>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>> getTransport(ThriftTransportPool.java:441)
>>             at
>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>> getTransportWithDefaultTimeout(ThriftTransportPool.java:366)
>>
>>
>>
>> In the slave's tserver.log all I see is
>>
>>     2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL: Lost
>>     tablet server lock (reason = LOCK_DELETED), exiting.
>>
>>
> --
>
> Kurt Christensen
> P.O. Box 811
> Westminster, MD 21158-0811
>
> ------------------------------------------------------------------------
> If you can't explain it simply, you don't understand it well enough. --
> Albert Einstein
>

Re: slave tserver not responding

Posted by Kurt Christensen <ho...@hoodel.com>.
SELINUX disabled? IPTABLES configured? I have nothing else.

Kurt

------

On 12/31/13 6:02 PM, Arshak Navruzyan wrote:
> I configured a new instance with a master and a slave tserver.  When I 
> do start-all on the master, the slave doesn't come up.  I am wondering 
> if it's because I left the instance secret as the default. (I get an 
> exception when I try to change that).
>
> This is what I see in the master's monitor regarding the slave
>
>     Non-Functioning Tablet Servers
>     The following tablet servers reported a status other than Online
>
> 10.240.203.36:9997 <http://10.240.203.36:9997> 	UNRESPONSIVE
>
>
> In the master log I see the following
>
>     2013-12-31 22:56:13,665 [master.Master] ERROR: unable to get
>     tablet server status 10.240.203.36:9997[1434a79d34404a2]
>     org.apache.thrift.transport.TTransportException:
>     java.net.NoRouteToHostException: No route to host
>     2013-12-31 22:56:13,712 [master.Master] ERROR: unable to get
>     tablet server status 10.240.203.36:9997[1434a79d34404a2]
>     org.apache.thrift.transport.TTransportException:
>     java.net.NoRouteToHostException: No route to host
>     2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO : Loaded
>     class
>     org.apache.accumulo.server.master.balancer.DefaultLoadBalancer for
>     table !0
>     2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1 tablets
>     2013-12-31 22:56:13,812 [master.Master] ERROR: Error processing
>     table state for store Root Tablet
>     org.apache.thrift.transport.TTransportException:
>     java.net.NoRouteToHostException: No route to host
>             at
>     org.apache.accumulo.core.client.impl.ThriftTransportPool.createNewTransport(ThriftTransportPool.java:475)
>             at
>     org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:464)
>             at
>     org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:441)
>             at
>     org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransportWithDefaultTimeout(ThriftTransportPool.java:366)
>
>
>
> In the slave's tserver.log all I see is
>
>     2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL: Lost
>     tablet server lock (reason = LOCK_DELETED), exiting.
>

-- 

Kurt Christensen
P.O. Box 811
Westminster, MD 21158-0811

------------------------------------------------------------------------
If you can't explain it simply, you don't understand it well enough. -- 
Albert Einstein

Re: slave tserver not responding

Posted by Sean Busbey <bu...@clouderagovt.com>.
What does the other /etc/hosts file look like?
On Dec 31, 2013 9:14 PM, "Arshak Navruzyan" <ar...@gmail.com> wrote:

> Josh,
>
> Yea Zookeeper is running on the master and I can connect to it using zkCli
> from the slave.
>
> /etc/hosts looks fine
>
> 127.0.0.1   localhost localhost.localdomain localhost4
> localhost4.localdomain4
> ::1         localhost localhost.localdomain localhost6
> localhost6.localdomain6
> 10.240.203.36 shoki.c.accumulo-test.internal shoki  # Added by Google
>
> Hmm, completely baffled!
>
> Arshak
>
>
> On Tue, Dec 31, 2013 at 6:35 PM, Josh Elser <jo...@gmail.com> wrote:
>
>> On 12/31/13, 6:37 PM, Arshak Navruzyan wrote:
>>
>>> Here is my route -n
>>>
>>> Kernel IP routing table
>>> Destination     Gateway         Genmask         Flags Metric Ref    Use
>>> Iface
>>> 10.240.0.1      0.0.0.0         255.255.255.255 UH    0      0        0
>>> eth0
>>> 169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0
>>> eth0
>>> 0.0.0.0         10.240.0.1      0.0.0.0         UG    0      0        0
>>> eth0
>>>
>>>
>>> "slave tserver" is another physical machine (well google compute engine
>>> instance).  Yes one gce instance is running master (and slave) and the
>>> other is running just slave.
>>>
>>> here is my config:
>>>
>>> masters:
>>> 10.240.165.43
>>>
>>> slaves:
>>> 10.240.165.43
>>> 10.240.203.36
>>>
>>> BTW when I run bin/check-slaves conf/slaves
>>> # WRITABLE value not configured, not checking partitions
>>> 10.240.165.43
>>> 10.240.203.36
>>>
>>> Is the master supposed to be listed in the slaves files too?
>>>
>>
>> No, your configuration files look correct.
>>
>> I'm not sure why but for whatever reason, your slave (10.240.203.36)
>> can't talk back to the master (10.240.165.43), but at least that's where
>> you want to look at things. You know that the master can talk to the slave
>> (otherwise the slave tserver would have never started) and that the slave
>> tserver can talk to ZooKeeper (that it had and then lost a lock in ZK). Are
>> you running ZooKeeper on the master (that would further isolate it in
>> debugging this).
>>
>> It may be worthwhile to double check your /etc/hosts entries just to be
>> safe. Aside from that, I can't think of anything else at the moment.
>>
>>
>>> On Tue, Dec 31, 2013 at 3:32 PM, Josh Elser <josh.elser@gmail.com
>>> <ma...@gmail.com>> wrote:
>>>
>>>     Maybe check the output of `route -n` on the master? It might be
>>>     something weird with DNS as well.
>>>
>>>     When you say "slave tserver", are you talking about a separate
>>>     physical machine? You have one node running the Accumulo master and
>>>     another running a tserver?
>>>
>>>
>>>     On 12/31/13, 6:02 PM, Arshak Navruzyan wrote:
>>>
>>>         I configured a new instance with a master and a slave tserver.
>>>           When I
>>>         do start-all on the master, the slave doesn't come up.  I am
>>>         wondering
>>>         if it's because I left the instance secret as the default. (I
>>> get an
>>>         exception when I try to change that).
>>>
>>>         This is what I see in the master's monitor regarding the slave
>>>
>>>              Non-Functioning Tablet Servers
>>>              The following tablet servers reported a status other than
>>>         Online
>>>
>>>         10.240.203.36:9997 <http://10.240.203.36:9997>
>>>         <http://10.240.203.36:9997>  UNRESPONSIVE
>>>
>>>
>>>
>>>         In the master log I see the following
>>>
>>>              2013-12-31 22:56:13,665 [master.Master] ERROR: unable to
>>>         get tablet
>>>              server status 10.240.203.36:9997[__1434a79d34404a2]
>>>              org.apache.thrift.transport.__TTransportException:
>>>         java.net <http://java.net>.__NoRouteToHostException: No route to
>>>
>>>         host
>>>              2013-12-31 22:56:13,712 [master.Master] ERROR: unable to
>>>         get tablet
>>>              server status 10.240.203.36:9997[__1434a79d34404a2]
>>>              org.apache.thrift.transport.__TTransportException:
>>>         java.net <http://java.net>.__NoRouteToHostException: No route to
>>>
>>>         host
>>>              2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO :
>>>         Loaded
>>>              class
>>>         org.apache.accumulo.server.__master.balancer.__
>>> DefaultLoadBalancer
>>>
>>>              for table !0
>>>              2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1
>>>         tablets
>>>              2013-12-31 22:56:13,812 [master.Master] ERROR: Error
>>> processing
>>>              table state for store Root Tablet
>>>              org.apache.thrift.transport.__TTransportException:
>>>         java.net <http://java.net>.__NoRouteToHostException: No route to
>>>         host
>>>                       at
>>>
>>>         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__
>>> createNewTransport(__ThriftTransportPool.java:475)
>>>                       at
>>>
>>>         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__
>>> getTransport(__ThriftTransportPool.java:464)
>>>                       at
>>>
>>>         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__
>>> getTransport(__ThriftTransportPool.java:441)
>>>                       at
>>>
>>>         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__
>>> getTransportWithDefaultTimeout__(ThriftTransportPool.java:366)
>>>
>>>
>>>
>>>
>>>         In the slave's tserver.log all I see is
>>>
>>>              2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL:
>>> Lost
>>>              tablet server lock (reason = LOCK_DELETED), exiting.
>>>
>>>
>>>
>

Re: slave tserver not responding

Posted by Arshak Navruzyan <ar...@gmail.com>.
Josh,

Yea Zookeeper is running on the master and I can connect to it using zkCli
from the slave.

/etc/hosts looks fine

127.0.0.1   localhost localhost.localdomain localhost4
localhost4.localdomain4
::1         localhost localhost.localdomain localhost6
localhost6.localdomain6
10.240.203.36 shoki.c.accumulo-test.internal shoki  # Added by Google

Hmm, completely baffled!

Arshak


On Tue, Dec 31, 2013 at 6:35 PM, Josh Elser <jo...@gmail.com> wrote:

> On 12/31/13, 6:37 PM, Arshak Navruzyan wrote:
>
>> Here is my route -n
>>
>> Kernel IP routing table
>> Destination     Gateway         Genmask         Flags Metric Ref    Use
>> Iface
>> 10.240.0.1      0.0.0.0         255.255.255.255 UH    0      0        0
>> eth0
>> 169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0
>> eth0
>> 0.0.0.0         10.240.0.1      0.0.0.0         UG    0      0        0
>> eth0
>>
>>
>> "slave tserver" is another physical machine (well google compute engine
>> instance).  Yes one gce instance is running master (and slave) and the
>> other is running just slave.
>>
>> here is my config:
>>
>> masters:
>> 10.240.165.43
>>
>> slaves:
>> 10.240.165.43
>> 10.240.203.36
>>
>> BTW when I run bin/check-slaves conf/slaves
>> # WRITABLE value not configured, not checking partitions
>> 10.240.165.43
>> 10.240.203.36
>>
>> Is the master supposed to be listed in the slaves files too?
>>
>
> No, your configuration files look correct.
>
> I'm not sure why but for whatever reason, your slave (10.240.203.36) can't
> talk back to the master (10.240.165.43), but at least that's where you want
> to look at things. You know that the master can talk to the slave
> (otherwise the slave tserver would have never started) and that the slave
> tserver can talk to ZooKeeper (that it had and then lost a lock in ZK). Are
> you running ZooKeeper on the master (that would further isolate it in
> debugging this).
>
> It may be worthwhile to double check your /etc/hosts entries just to be
> safe. Aside from that, I can't think of anything else at the moment.
>
>
>> On Tue, Dec 31, 2013 at 3:32 PM, Josh Elser <josh.elser@gmail.com
>> <ma...@gmail.com>> wrote:
>>
>>     Maybe check the output of `route -n` on the master? It might be
>>     something weird with DNS as well.
>>
>>     When you say "slave tserver", are you talking about a separate
>>     physical machine? You have one node running the Accumulo master and
>>     another running a tserver?
>>
>>
>>     On 12/31/13, 6:02 PM, Arshak Navruzyan wrote:
>>
>>         I configured a new instance with a master and a slave tserver.
>>           When I
>>         do start-all on the master, the slave doesn't come up.  I am
>>         wondering
>>         if it's because I left the instance secret as the default. (I get
>> an
>>         exception when I try to change that).
>>
>>         This is what I see in the master's monitor regarding the slave
>>
>>              Non-Functioning Tablet Servers
>>              The following tablet servers reported a status other than
>>         Online
>>
>>         10.240.203.36:9997 <http://10.240.203.36:9997>
>>         <http://10.240.203.36:9997>  UNRESPONSIVE
>>
>>
>>
>>         In the master log I see the following
>>
>>              2013-12-31 22:56:13,665 [master.Master] ERROR: unable to
>>         get tablet
>>              server status 10.240.203.36:9997[__1434a79d34404a2]
>>              org.apache.thrift.transport.__TTransportException:
>>         java.net <http://java.net>.__NoRouteToHostException: No route to
>>
>>         host
>>              2013-12-31 22:56:13,712 [master.Master] ERROR: unable to
>>         get tablet
>>              server status 10.240.203.36:9997[__1434a79d34404a2]
>>              org.apache.thrift.transport.__TTransportException:
>>         java.net <http://java.net>.__NoRouteToHostException: No route to
>>
>>         host
>>              2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO :
>>         Loaded
>>              class
>>         org.apache.accumulo.server.__master.balancer.__
>> DefaultLoadBalancer
>>
>>              for table !0
>>              2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1
>>         tablets
>>              2013-12-31 22:56:13,812 [master.Master] ERROR: Error
>> processing
>>              table state for store Root Tablet
>>              org.apache.thrift.transport.__TTransportException:
>>         java.net <http://java.net>.__NoRouteToHostException: No route to
>>         host
>>                       at
>>
>>         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__
>> createNewTransport(__ThriftTransportPool.java:475)
>>                       at
>>
>>         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__
>> getTransport(__ThriftTransportPool.java:464)
>>                       at
>>
>>         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__
>> getTransport(__ThriftTransportPool.java:441)
>>                       at
>>
>>         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__
>> getTransportWithDefaultTimeout__(ThriftTransportPool.java:366)
>>
>>
>>
>>
>>         In the slave's tserver.log all I see is
>>
>>              2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL:
>> Lost
>>              tablet server lock (reason = LOCK_DELETED), exiting.
>>
>>
>>

Re: slave tserver not responding

Posted by Josh Elser <jo...@gmail.com>.
On 12/31/13, 6:37 PM, Arshak Navruzyan wrote:
> Here is my route -n
>
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags Metric Ref    Use
> Iface
> 10.240.0.1      0.0.0.0         255.255.255.255 UH    0      0        0 eth0
> 169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 eth0
> 0.0.0.0         10.240.0.1      0.0.0.0         UG    0      0        0 eth0
>
>
> "slave tserver" is another physical machine (well google compute engine
> instance).  Yes one gce instance is running master (and slave) and the
> other is running just slave.
>
> here is my config:
>
> masters:
> 10.240.165.43
>
> slaves:
> 10.240.165.43
> 10.240.203.36
>
> BTW when I run bin/check-slaves conf/slaves
> # WRITABLE value not configured, not checking partitions
> 10.240.165.43
> 10.240.203.36
>
> Is the master supposed to be listed in the slaves files too?

No, your configuration files look correct.

I'm not sure why but for whatever reason, your slave (10.240.203.36) 
can't talk back to the master (10.240.165.43), but at least that's where 
you want to look at things. You know that the master can talk to the 
slave (otherwise the slave tserver would have never started) and that 
the slave tserver can talk to ZooKeeper (that it had and then lost a 
lock in ZK). Are you running ZooKeeper on the master (that would further 
isolate it in debugging this).

It may be worthwhile to double check your /etc/hosts entries just to be 
safe. Aside from that, I can't think of anything else at the moment.

>
> On Tue, Dec 31, 2013 at 3:32 PM, Josh Elser <josh.elser@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Maybe check the output of `route -n` on the master? It might be
>     something weird with DNS as well.
>
>     When you say "slave tserver", are you talking about a separate
>     physical machine? You have one node running the Accumulo master and
>     another running a tserver?
>
>
>     On 12/31/13, 6:02 PM, Arshak Navruzyan wrote:
>
>         I configured a new instance with a master and a slave tserver.
>           When I
>         do start-all on the master, the slave doesn't come up.  I am
>         wondering
>         if it's because I left the instance secret as the default. (I get an
>         exception when I try to change that).
>
>         This is what I see in the master's monitor regarding the slave
>
>              Non-Functioning Tablet Servers
>              The following tablet servers reported a status other than
>         Online
>
>         10.240.203.36:9997 <http://10.240.203.36:9997>
>         <http://10.240.203.36:9997>  UNRESPONSIVE
>
>
>
>         In the master log I see the following
>
>              2013-12-31 22:56:13,665 [master.Master] ERROR: unable to
>         get tablet
>              server status 10.240.203.36:9997[__1434a79d34404a2]
>              org.apache.thrift.transport.__TTransportException:
>         java.net <http://java.net>.__NoRouteToHostException: No route to
>         host
>              2013-12-31 22:56:13,712 [master.Master] ERROR: unable to
>         get tablet
>              server status 10.240.203.36:9997[__1434a79d34404a2]
>              org.apache.thrift.transport.__TTransportException:
>         java.net <http://java.net>.__NoRouteToHostException: No route to
>         host
>              2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO :
>         Loaded
>              class
>         org.apache.accumulo.server.__master.balancer.__DefaultLoadBalancer
>              for table !0
>              2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1
>         tablets
>              2013-12-31 22:56:13,812 [master.Master] ERROR: Error processing
>              table state for store Root Tablet
>              org.apache.thrift.transport.__TTransportException:
>         java.net <http://java.net>.__NoRouteToHostException: No route to
>         host
>                       at
>
>         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__createNewTransport(__ThriftTransportPool.java:475)
>                       at
>
>         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__getTransport(__ThriftTransportPool.java:464)
>                       at
>
>         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__getTransport(__ThriftTransportPool.java:441)
>                       at
>
>         org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__getTransportWithDefaultTimeout__(ThriftTransportPool.java:366)
>
>
>
>         In the slave's tserver.log all I see is
>
>              2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL: Lost
>              tablet server lock (reason = LOCK_DELETED), exiting.
>
>

Re: slave tserver not responding

Posted by Arshak Navruzyan <ar...@gmail.com>.
Here is my route -n

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use
Iface
10.240.0.1      0.0.0.0         255.255.255.255 UH    0      0        0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 eth0
0.0.0.0         10.240.0.1      0.0.0.0         UG    0      0        0 eth0


"slave tserver" is another physical machine (well google compute engine
instance).  Yes one gce instance is running master (and slave) and the
other is running just slave.

here is my config:

masters:
10.240.165.43

slaves:
10.240.165.43
10.240.203.36

BTW when I run bin/check-slaves conf/slaves
# WRITABLE value not configured, not checking partitions
10.240.165.43
10.240.203.36

Is the master supposed to be listed in the slaves files too?


On Tue, Dec 31, 2013 at 3:32 PM, Josh Elser <jo...@gmail.com> wrote:

> Maybe check the output of `route -n` on the master? It might be something
> weird with DNS as well.
>
> When you say "slave tserver", are you talking about a separate physical
> machine? You have one node running the Accumulo master and another running
> a tserver?
>
>
> On 12/31/13, 6:02 PM, Arshak Navruzyan wrote:
>
>> I configured a new instance with a master and a slave tserver.  When I
>> do start-all on the master, the slave doesn't come up.  I am wondering
>> if it's because I left the instance secret as the default. (I get an
>> exception when I try to change that).
>>
>> This is what I see in the master's monitor regarding the slave
>>
>>     Non-Functioning Tablet Servers
>>     The following tablet servers reported a status other than Online
>>
>> 10.240.203.36:9997 <http://10.240.203.36:9997>  UNRESPONSIVE
>>
>>
>>
>> In the master log I see the following
>>
>>     2013-12-31 22:56:13,665 [master.Master] ERROR: unable to get tablet
>>     server status 10.240.203.36:9997[1434a79d34404a2]
>>     org.apache.thrift.transport.TTransportException:
>>     java.net.NoRouteToHostException: No route to host
>>     2013-12-31 22:56:13,712 [master.Master] ERROR: unable to get tablet
>>     server status 10.240.203.36:9997[1434a79d34404a2]
>>     org.apache.thrift.transport.TTransportException:
>>     java.net.NoRouteToHostException: No route to host
>>     2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO : Loaded
>>     class org.apache.accumulo.server.master.balancer.DefaultLoadBalancer
>>     for table !0
>>     2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1 tablets
>>     2013-12-31 22:56:13,812 [master.Master] ERROR: Error processing
>>     table state for store Root Tablet
>>     org.apache.thrift.transport.TTransportException:
>>     java.net.NoRouteToHostException: No route to host
>>              at
>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>> createNewTransport(ThriftTransportPool.java:475)
>>              at
>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>> getTransport(ThriftTransportPool.java:464)
>>              at
>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>> getTransport(ThriftTransportPool.java:441)
>>              at
>>     org.apache.accumulo.core.client.impl.ThriftTransportPool.
>> getTransportWithDefaultTimeout(ThriftTransportPool.java:366)
>>
>>
>>
>> In the slave's tserver.log all I see is
>>
>>     2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL: Lost
>>     tablet server lock (reason = LOCK_DELETED), exiting.
>>
>>

Re: slave tserver not responding

Posted by Josh Elser <jo...@gmail.com>.
Maybe check the output of `route -n` on the master? It might be 
something weird with DNS as well.

When you say "slave tserver", are you talking about a separate physical 
machine? You have one node running the Accumulo master and another 
running a tserver?

On 12/31/13, 6:02 PM, Arshak Navruzyan wrote:
> I configured a new instance with a master and a slave tserver.  When I
> do start-all on the master, the slave doesn't come up.  I am wondering
> if it's because I left the instance secret as the default. (I get an
> exception when I try to change that).
>
> This is what I see in the master's monitor regarding the slave
>
>     Non-Functioning Tablet Servers
>     The following tablet servers reported a status other than Online
>
> 10.240.203.36:9997 <http://10.240.203.36:9997>	UNRESPONSIVE
>
>
> In the master log I see the following
>
>     2013-12-31 22:56:13,665 [master.Master] ERROR: unable to get tablet
>     server status 10.240.203.36:9997[1434a79d34404a2]
>     org.apache.thrift.transport.TTransportException:
>     java.net.NoRouteToHostException: No route to host
>     2013-12-31 22:56:13,712 [master.Master] ERROR: unable to get tablet
>     server status 10.240.203.36:9997[1434a79d34404a2]
>     org.apache.thrift.transport.TTransportException:
>     java.net.NoRouteToHostException: No route to host
>     2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO : Loaded
>     class org.apache.accumulo.server.master.balancer.DefaultLoadBalancer
>     for table !0
>     2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1 tablets
>     2013-12-31 22:56:13,812 [master.Master] ERROR: Error processing
>     table state for store Root Tablet
>     org.apache.thrift.transport.TTransportException:
>     java.net.NoRouteToHostException: No route to host
>              at
>     org.apache.accumulo.core.client.impl.ThriftTransportPool.createNewTransport(ThriftTransportPool.java:475)
>              at
>     org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:464)
>              at
>     org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:441)
>              at
>     org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransportWithDefaultTimeout(ThriftTransportPool.java:366)
>
>
>
> In the slave's tserver.log all I see is
>
>     2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL: Lost
>     tablet server lock (reason = LOCK_DELETED), exiting.
>