You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Kristoffer Sjögren <st...@gmail.com> on 2015/11/24 09:55:25 UTC

Phantom region server and PENDING_OPEN regions

Hi

I'm trying to install a HBase cluster with 1 master
(amb1.service.consul) and 1 region server (amb2.service.consul) using
Ambari on docker containers provided by sequenceiq [1] using a custom
blueprint [2].

Every component installs correctly except for HBase which get stuck
with regions in transition:

---
hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24 08:26:45
UTC 2015 (1098s ago), server=amb2.service.consul,16020,1448353564099
---

And for some reason 2 region servers (instead of 1) are discovered by
the master with the exact same timestamp but with different hostnames.
I'm not sure if this is the reason why the regions get stuck.

----
amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
----

The only place I can find "amb2.node.dc1.consul" on the ambari
agent/server hosts is in /etc/resolv.conf which looks like this.

----
nameserver 172.17.0.82
search service.consul node.dc1.consul
----

Is there some way that I can manually tell the master to disregard the
"phantom" host amb2.node.dc1.consul?

Any help or tips appreciated.

Cheers,
-Kristoffer


[1] https://github.com/sequenceiq/docker-ambari
[2] https://gist.githubusercontent.com/krisskross/901ed8223c1ed1db80e3/raw/869327be9ad15e6a9f099a7591323244cd245357/ambari-hdp2.3

Re: Phantom region server and PENDING_OPEN regions

Posted by Kristoffer Sjögren <st...@gmail.com>.
Nope, $HOSTNAME is *.service.consul. But I actually got it working now.

The problem was that the sequenceio ambari installed a consul docker
instance with a nameserver which answered with the wrong hostname.
When I removed the nameserver from /etc/resolve.conf and instead added
the correct *.service.consul hostnames to /etc/hosts... everything
worked! :-)

Thanks for your help, much appreciated!

On Tue, Nov 24, 2015 at 1:21 PM, Samir Ahmic <ah...@gmail.com> wrote:
> Your hosts file looks fine. If i understand correctly value of $HOSTNAME
> env variable is  *.node.dc1.consul ? Try changing servers hostname to
> *.service.consul.
> Also try to disable resolution by DNS server, Comment all lines in
> /etc/resolve.conf.
>
> Regards
> Samir
>
> On Tue, Nov 24, 2015 at 12:29 PM, Kristoffer Sjögren <st...@gmail.com>
> wrote:
>
>> Only one network interface on all machines. The ping is interesting,
>> both machines respond with *.node.dc1.consul but internally
>> *.service.consul.
>>
>> amb1.service.consul /etc/hosts
>> 172.17.0.89 amb1.service.consul amb1
>> 127.0.0.1 localhost
>> ::1 localhost ip6-localhost ip6-loopback
>> fe00::0 ip6-localnet
>> ff00::0 ip6-mcastprefix
>> ff02::1 ip6-allnodes
>> ff02::2 ip6-allrouters
>>
>> amb2.service.consul /etc/hosts
>> 172.17.0.90 amb2.service.consul amb2
>> 127.0.0.1 localhost
>> ::1 localhost ip6-localhost ip6-loopback
>> fe00::0 ip6-localnet
>> ff00::0 ip6-mcastprefix
>> ff02::1 ip6-allnodes
>> ff02::2 ip6-allrouters
>>
>>
>> ping amb1 from amb1.service.consul
>>
>> PING amb1.service.consul (172.17.0.89) 56(84) bytes of data.
>> 64 bytes from amb1.service.consul (172.17.0.89): icmp_seq=1 ttl=64
>> time=0.059 ms
>>
>> ping amb2 from amb1.service.consul
>>
>> PING amb2.service.consul (172.17.0.90) 56(84) bytes of data.
>> 64 bytes from amb2.node.dc1.consul (172.17.0.90): icmp_seq=1 ttl=64
>> time=0.069 ms
>>
>> ping amb1 from amb2.service.consul
>>
>> PING amb1.service.consul (172.17.0.89) 56(84) bytes of data.
>> 64 bytes from amb1.node.dc1.consul (172.17.0.89): icmp_seq=1 ttl=64
>> time=0.070 ms
>>
>> ping amb2 from amb2.service.consul
>>
>> PING amb2.service.consul (172.17.0.90) 56(84) bytes of data.
>> 64 bytes from amb2.service.consul (172.17.0.90): icmp_seq=1 ttl=64
>> time=0.054 ms
>>
>> On Tue, Nov 24, 2015 at 11:58 AM, Samir Ahmic <ah...@gmail.com>
>> wrote:
>> > As I can see from logs you also have issue with connecting to zk.
>> > Configuration points to correct server but  server resolution produce
>> wrong
>> > values.  Do you have multiple network interfaces on servers?  What ping
>> > $HOSTNAME returns? What do you have in /etc/hosts file? Do you have some
>> > local nameserver running on servers ?
>> >
>> > Regards
>> > Samir
>> > On Nov 24, 2015 11:21 AM, "Kristoffer Sjögren" <st...@gmail.com> wrote:
>> >
>> >> The logs on the region server [1] is also quite interesting.
>> >>
>> >> Before I restarted the cluster, the region server complains about
>> >> hijacked amb2.node.dc1.consul hijacked the regions from
>> >> amb2.service.consul.
>> >>
>> >> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
>> >> zookeeper.ZKAssign: regionserver:16020-0x1513899be420000,
>> >> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
>> >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
>> >> to RS_ZK_REGION_OPENING failed, the server that tried to transition
>> >> was amb2.node.dc1.consul,16020,1448353564099 not the expected
>> >> amb2.service.consul,16020,1448353564099
>> >> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
>> >> coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE
>> >> to OPENING for region=1588230740
>> >> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
>> >> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
>> >> encodedName=1588230740
>> >> 2015-11-24 08:26:45,100 INFO  [RS_OPEN_META-amb2:16020-0]
>> >> coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =>
>> >> 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
>> >> failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting
>> >> version 0
>> >> 2015-11-24 08:26:45,101 WARN  [RS_OPEN_META-amb2:16020-0]
>> >> zookeeper.ZKAssign: regionserver:16020-0x1513899be420000,
>> >> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
>> >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
>> >> to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to
>> >> transition was amb2.node.dc1.consul,16020,1448353564099 not the
>> >> expected amb2.service.consul,16020,1448353564099
>> >>
>> >>
>> >> After editing resolv.conf and restarted the cluster it still complains
>> >> about amb2.node.dc1.consul trying to transition the regions instead of
>> >> amb2.service.consul.
>> >>
>> >> 2015-11-24 09:32:26,334 WARN  [RS_OPEN_META-amb2:16020-0]
>> >> zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d,
>> >> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
>> >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
>> >> to RS_ZK_REGION_OPENING failed, the server that tried to transition
>> >> was amb2.node.dc1.consul,16020,1448357534179 not the expected
>> >> amb2.service.consul,16020,1448357534179
>> >> 2015-11-24 09:32:26,335 WARN  [RS_OPEN_META-amb2:16020-0]
>> >> coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE
>> >> to OPENING for region=1588230740
>> >> 2015-11-24 09:32:26,335 WARN  [RS_OPEN_META-amb2:16020-0]
>> >> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
>> >> encodedName=1588230740
>> >> 2015-11-24 09:32:26,335 INFO  [RS_OPEN_META-amb2:16020-0]
>> >> coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =>
>> >> 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
>> >> failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting
>> >> version 2
>> >> 2015-11-24 09:32:26,336 WARN  [RS_OPEN_META-amb2:16020-0]
>> >> zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d,
>> >> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
>> >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
>> >> to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to
>> >> transition was amb2.node.dc1.consul,16020,1448357534179 not the
>> >> expected amb2.service.consul,16020,1448357534179
>> >>
>> >>
>> >> [1] http://pastebin.com/z93p8Mdu
>> >>
>> >> On Tue, Nov 24, 2015 at 10:48 AM, Kristoffer Sjögren <st...@gmail.com>
>> >> wrote:
>> >> > I removed the node.dc1.consul from resolve.conf and restarted the
>> >> > cluster but it still shows up on the master UI.
>> >> >
>> >> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
>> >> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
>> >> >
>> >> > The logs report [1] that the meta region fails to assign to
>> >> > node.dc1.consul and then tries to assign it to amb2.service.consul and
>> >> > gets stuck in PENDING_OPEN again.
>> >> >
>> >> > ---
>> >> > 1588230740hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24
>> >> > 09:32:26 UTC 2015 (450s ago),
>> >> > server=amb2.service.consul,16020,1448357534179450511
>> >> > ---
>> >> >
>> >> > Before I restarted the cluster, the master log [2] complained about
>> >> > not being able to connect to amb2.node.dc1.consul/172.17.0.85:16020.
>> >> >
>> >> > Im not sure but somehow it feels as if amb2.node.dc1.consul shadows
>> >> > the real host amb2.service.consul.
>> >> >
>> >> > I was looking into the source code and found the configuration
>> >> > 'hbase.regionserver.hostname' - could that be of help here to remove
>> >> > the node.dc1 host?
>> >> >
>> >> > [1] http://pastebin.com/uZKqK9BJ
>> >> > [2] http://pastebin.com/s10E2rtA
>> >> >
>> >> > On Tue, Nov 24, 2015 at 10:23 AM, Samir Ahmic <ah...@gmail.com>
>> >> wrote:
>> >> >> Hi Kristoffer,
>> >> >> It looks like you have some issue with name resolution. Try to remove
>> >> >> incorrect value from reslove.conf (node.dc1.consul) and then restart
>> >> hbase
>> >> >> cluster.
>> >> >> Regarding issue with region in transition check master log for
>> >> >> "hbase:meta,,1.1588230740"
>> >> >> there should be exception explaining why hbase:meta can to be
>> transition
>> >> >> from PENDING_OPEN to OPEN state, if hbase:meta table is unavailable
>> >> master
>> >> >> can not finish initialization.
>> >> >>
>> >> >> Regards
>> >> >> Samir
>> >> >>
>> >> >> On Tue, Nov 24, 2015 at 10:11 AM, Kristoffer Sjögren <
>> stoffe@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >>> Sorry, I should mention that this is HBase 1.1.2.
>> >> >>>
>> >> >>> Zookeeper only report one region server.
>> >> >>>
>> >> >>> $ ls /hbase-unsecure/rs
>> >> >>> [amb2.service.consul,16020,1448353564099]
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On Tue, Nov 24, 2015 at 9:55 AM, Kristoffer Sjögren <
>> stoffe@gmail.com>
>> >> >>> wrote:
>> >> >>> > Hi
>> >> >>> >
>> >> >>> > I'm trying to install a HBase cluster with 1 master
>> >> >>> > (amb1.service.consul) and 1 region server (amb2.service.consul)
>> using
>> >> >>> > Ambari on docker containers provided by sequenceiq [1] using a
>> custom
>> >> >>> > blueprint [2].
>> >> >>> >
>> >> >>> > Every component installs correctly except for HBase which get
>> stuck
>> >> >>> > with regions in transition:
>> >> >>> >
>> >> >>> > ---
>> >> >>> > hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24
>> 08:26:45
>> >> >>> > UTC 2015 (1098s ago),
>> server=amb2.service.consul,16020,1448353564099
>> >> >>> > ---
>> >> >>> >
>> >> >>> > And for some reason 2 region servers (instead of 1) are
>> discovered by
>> >> >>> > the master with the exact same timestamp but with different
>> >> hostnames.
>> >> >>> > I'm not sure if this is the reason why the regions get stuck.
>> >> >>> >
>> >> >>> > ----
>> >> >>> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC
>> >> 201500
>> >> >>> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC
>> 201500
>> >> >>> > ----
>> >> >>> >
>> >> >>> > The only place I can find "amb2.node.dc1.consul" on the ambari
>> >> >>> > agent/server hosts is in /etc/resolv.conf which looks like this.
>> >> >>> >
>> >> >>> > ----
>> >> >>> > nameserver 172.17.0.82
>> >> >>> > search service.consul node.dc1.consul
>> >> >>> > ----
>> >> >>> >
>> >> >>> > Is there some way that I can manually tell the master to disregard
>> >> the
>> >> >>> > "phantom" host amb2.node.dc1.consul?
>> >> >>> >
>> >> >>> > Any help or tips appreciated.
>> >> >>> >
>> >> >>> > Cheers,
>> >> >>> > -Kristoffer
>> >> >>> >
>> >> >>> >
>> >> >>> > [1] https://github.com/sequenceiq/docker-ambari
>> >> >>> > [2]
>> >> >>>
>> >>
>> https://gist.githubusercontent.com/krisskross/901ed8223c1ed1db80e3/raw/869327be9ad15e6a9f099a7591323244cd245357/ambari-hdp2.3
>> >> >>>
>> >>
>>

Re: Phantom region server and PENDING_OPEN regions

Posted by Samir Ahmic <ah...@gmail.com>.
Your hosts file looks fine. If i understand correctly value of $HOSTNAME
env variable is  *.node.dc1.consul ? Try changing servers hostname to
*.service.consul.
Also try to disable resolution by DNS server, Comment all lines in
/etc/resolve.conf.

Regards
Samir

On Tue, Nov 24, 2015 at 12:29 PM, Kristoffer Sjögren <st...@gmail.com>
wrote:

> Only one network interface on all machines. The ping is interesting,
> both machines respond with *.node.dc1.consul but internally
> *.service.consul.
>
> amb1.service.consul /etc/hosts
> 172.17.0.89 amb1.service.consul amb1
> 127.0.0.1 localhost
> ::1 localhost ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
>
> amb2.service.consul /etc/hosts
> 172.17.0.90 amb2.service.consul amb2
> 127.0.0.1 localhost
> ::1 localhost ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
>
>
> ping amb1 from amb1.service.consul
>
> PING amb1.service.consul (172.17.0.89) 56(84) bytes of data.
> 64 bytes from amb1.service.consul (172.17.0.89): icmp_seq=1 ttl=64
> time=0.059 ms
>
> ping amb2 from amb1.service.consul
>
> PING amb2.service.consul (172.17.0.90) 56(84) bytes of data.
> 64 bytes from amb2.node.dc1.consul (172.17.0.90): icmp_seq=1 ttl=64
> time=0.069 ms
>
> ping amb1 from amb2.service.consul
>
> PING amb1.service.consul (172.17.0.89) 56(84) bytes of data.
> 64 bytes from amb1.node.dc1.consul (172.17.0.89): icmp_seq=1 ttl=64
> time=0.070 ms
>
> ping amb2 from amb2.service.consul
>
> PING amb2.service.consul (172.17.0.90) 56(84) bytes of data.
> 64 bytes from amb2.service.consul (172.17.0.90): icmp_seq=1 ttl=64
> time=0.054 ms
>
> On Tue, Nov 24, 2015 at 11:58 AM, Samir Ahmic <ah...@gmail.com>
> wrote:
> > As I can see from logs you also have issue with connecting to zk.
> > Configuration points to correct server but  server resolution produce
> wrong
> > values.  Do you have multiple network interfaces on servers?  What ping
> > $HOSTNAME returns? What do you have in /etc/hosts file? Do you have some
> > local nameserver running on servers ?
> >
> > Regards
> > Samir
> > On Nov 24, 2015 11:21 AM, "Kristoffer Sjögren" <st...@gmail.com> wrote:
> >
> >> The logs on the region server [1] is also quite interesting.
> >>
> >> Before I restarted the cluster, the region server complains about
> >> hijacked amb2.node.dc1.consul hijacked the regions from
> >> amb2.service.consul.
> >>
> >> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
> >> zookeeper.ZKAssign: regionserver:16020-0x1513899be420000,
> >> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
> >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
> >> to RS_ZK_REGION_OPENING failed, the server that tried to transition
> >> was amb2.node.dc1.consul,16020,1448353564099 not the expected
> >> amb2.service.consul,16020,1448353564099
> >> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
> >> coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE
> >> to OPENING for region=1588230740
> >> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
> >> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
> >> encodedName=1588230740
> >> 2015-11-24 08:26:45,100 INFO  [RS_OPEN_META-amb2:16020-0]
> >> coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =>
> >> 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
> >> failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting
> >> version 0
> >> 2015-11-24 08:26:45,101 WARN  [RS_OPEN_META-amb2:16020-0]
> >> zookeeper.ZKAssign: regionserver:16020-0x1513899be420000,
> >> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
> >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
> >> to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to
> >> transition was amb2.node.dc1.consul,16020,1448353564099 not the
> >> expected amb2.service.consul,16020,1448353564099
> >>
> >>
> >> After editing resolv.conf and restarted the cluster it still complains
> >> about amb2.node.dc1.consul trying to transition the regions instead of
> >> amb2.service.consul.
> >>
> >> 2015-11-24 09:32:26,334 WARN  [RS_OPEN_META-amb2:16020-0]
> >> zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d,
> >> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
> >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
> >> to RS_ZK_REGION_OPENING failed, the server that tried to transition
> >> was amb2.node.dc1.consul,16020,1448357534179 not the expected
> >> amb2.service.consul,16020,1448357534179
> >> 2015-11-24 09:32:26,335 WARN  [RS_OPEN_META-amb2:16020-0]
> >> coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE
> >> to OPENING for region=1588230740
> >> 2015-11-24 09:32:26,335 WARN  [RS_OPEN_META-amb2:16020-0]
> >> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
> >> encodedName=1588230740
> >> 2015-11-24 09:32:26,335 INFO  [RS_OPEN_META-amb2:16020-0]
> >> coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =>
> >> 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
> >> failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting
> >> version 2
> >> 2015-11-24 09:32:26,336 WARN  [RS_OPEN_META-amb2:16020-0]
> >> zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d,
> >> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
> >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
> >> to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to
> >> transition was amb2.node.dc1.consul,16020,1448357534179 not the
> >> expected amb2.service.consul,16020,1448357534179
> >>
> >>
> >> [1] http://pastebin.com/z93p8Mdu
> >>
> >> On Tue, Nov 24, 2015 at 10:48 AM, Kristoffer Sjögren <st...@gmail.com>
> >> wrote:
> >> > I removed the node.dc1.consul from resolve.conf and restarted the
> >> > cluster but it still shows up on the master UI.
> >> >
> >> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
> >> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
> >> >
> >> > The logs report [1] that the meta region fails to assign to
> >> > node.dc1.consul and then tries to assign it to amb2.service.consul and
> >> > gets stuck in PENDING_OPEN again.
> >> >
> >> > ---
> >> > 1588230740hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24
> >> > 09:32:26 UTC 2015 (450s ago),
> >> > server=amb2.service.consul,16020,1448357534179450511
> >> > ---
> >> >
> >> > Before I restarted the cluster, the master log [2] complained about
> >> > not being able to connect to amb2.node.dc1.consul/172.17.0.85:16020.
> >> >
> >> > Im not sure but somehow it feels as if amb2.node.dc1.consul shadows
> >> > the real host amb2.service.consul.
> >> >
> >> > I was looking into the source code and found the configuration
> >> > 'hbase.regionserver.hostname' - could that be of help here to remove
> >> > the node.dc1 host?
> >> >
> >> > [1] http://pastebin.com/uZKqK9BJ
> >> > [2] http://pastebin.com/s10E2rtA
> >> >
> >> > On Tue, Nov 24, 2015 at 10:23 AM, Samir Ahmic <ah...@gmail.com>
> >> wrote:
> >> >> Hi Kristoffer,
> >> >> It looks like you have some issue with name resolution. Try to remove
> >> >> incorrect value from reslove.conf (node.dc1.consul) and then restart
> >> hbase
> >> >> cluster.
> >> >> Regarding issue with region in transition check master log for
> >> >> "hbase:meta,,1.1588230740"
> >> >> there should be exception explaining why hbase:meta can to be
> transition
> >> >> from PENDING_OPEN to OPEN state, if hbase:meta table is unavailable
> >> master
> >> >> can not finish initialization.
> >> >>
> >> >> Regards
> >> >> Samir
> >> >>
> >> >> On Tue, Nov 24, 2015 at 10:11 AM, Kristoffer Sjögren <
> stoffe@gmail.com>
> >> >> wrote:
> >> >>
> >> >>> Sorry, I should mention that this is HBase 1.1.2.
> >> >>>
> >> >>> Zookeeper only report one region server.
> >> >>>
> >> >>> $ ls /hbase-unsecure/rs
> >> >>> [amb2.service.consul,16020,1448353564099]
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Tue, Nov 24, 2015 at 9:55 AM, Kristoffer Sjögren <
> stoffe@gmail.com>
> >> >>> wrote:
> >> >>> > Hi
> >> >>> >
> >> >>> > I'm trying to install a HBase cluster with 1 master
> >> >>> > (amb1.service.consul) and 1 region server (amb2.service.consul)
> using
> >> >>> > Ambari on docker containers provided by sequenceiq [1] using a
> custom
> >> >>> > blueprint [2].
> >> >>> >
> >> >>> > Every component installs correctly except for HBase which get
> stuck
> >> >>> > with regions in transition:
> >> >>> >
> >> >>> > ---
> >> >>> > hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24
> 08:26:45
> >> >>> > UTC 2015 (1098s ago),
> server=amb2.service.consul,16020,1448353564099
> >> >>> > ---
> >> >>> >
> >> >>> > And for some reason 2 region servers (instead of 1) are
> discovered by
> >> >>> > the master with the exact same timestamp but with different
> >> hostnames.
> >> >>> > I'm not sure if this is the reason why the regions get stuck.
> >> >>> >
> >> >>> > ----
> >> >>> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC
> >> 201500
> >> >>> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC
> 201500
> >> >>> > ----
> >> >>> >
> >> >>> > The only place I can find "amb2.node.dc1.consul" on the ambari
> >> >>> > agent/server hosts is in /etc/resolv.conf which looks like this.
> >> >>> >
> >> >>> > ----
> >> >>> > nameserver 172.17.0.82
> >> >>> > search service.consul node.dc1.consul
> >> >>> > ----
> >> >>> >
> >> >>> > Is there some way that I can manually tell the master to disregard
> >> the
> >> >>> > "phantom" host amb2.node.dc1.consul?
> >> >>> >
> >> >>> > Any help or tips appreciated.
> >> >>> >
> >> >>> > Cheers,
> >> >>> > -Kristoffer
> >> >>> >
> >> >>> >
> >> >>> > [1] https://github.com/sequenceiq/docker-ambari
> >> >>> > [2]
> >> >>>
> >>
> https://gist.githubusercontent.com/krisskross/901ed8223c1ed1db80e3/raw/869327be9ad15e6a9f099a7591323244cd245357/ambari-hdp2.3
> >> >>>
> >>
>

Re: Phantom region server and PENDING_OPEN regions

Posted by Kristoffer Sjögren <st...@gmail.com>.
Only one network interface on all machines. The ping is interesting,
both machines respond with *.node.dc1.consul but internally
*.service.consul.

amb1.service.consul /etc/hosts
172.17.0.89 amb1.service.consul amb1
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

amb2.service.consul /etc/hosts
172.17.0.90 amb2.service.consul amb2
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters


ping amb1 from amb1.service.consul

PING amb1.service.consul (172.17.0.89) 56(84) bytes of data.
64 bytes from amb1.service.consul (172.17.0.89): icmp_seq=1 ttl=64 time=0.059 ms

ping amb2 from amb1.service.consul

PING amb2.service.consul (172.17.0.90) 56(84) bytes of data.
64 bytes from amb2.node.dc1.consul (172.17.0.90): icmp_seq=1 ttl=64
time=0.069 ms

ping amb1 from amb2.service.consul

PING amb1.service.consul (172.17.0.89) 56(84) bytes of data.
64 bytes from amb1.node.dc1.consul (172.17.0.89): icmp_seq=1 ttl=64
time=0.070 ms

ping amb2 from amb2.service.consul

PING amb2.service.consul (172.17.0.90) 56(84) bytes of data.
64 bytes from amb2.service.consul (172.17.0.90): icmp_seq=1 ttl=64 time=0.054 ms

On Tue, Nov 24, 2015 at 11:58 AM, Samir Ahmic <ah...@gmail.com> wrote:
> As I can see from logs you also have issue with connecting to zk.
> Configuration points to correct server but  server resolution produce wrong
> values.  Do you have multiple network interfaces on servers?  What ping
> $HOSTNAME returns? What do you have in /etc/hosts file? Do you have some
> local nameserver running on servers ?
>
> Regards
> Samir
> On Nov 24, 2015 11:21 AM, "Kristoffer Sjögren" <st...@gmail.com> wrote:
>
>> The logs on the region server [1] is also quite interesting.
>>
>> Before I restarted the cluster, the region server complains about
>> hijacked amb2.node.dc1.consul hijacked the regions from
>> amb2.service.consul.
>>
>> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
>> zookeeper.ZKAssign: regionserver:16020-0x1513899be420000,
>> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
>> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
>> to RS_ZK_REGION_OPENING failed, the server that tried to transition
>> was amb2.node.dc1.consul,16020,1448353564099 not the expected
>> amb2.service.consul,16020,1448353564099
>> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
>> coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE
>> to OPENING for region=1588230740
>> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
>> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
>> encodedName=1588230740
>> 2015-11-24 08:26:45,100 INFO  [RS_OPEN_META-amb2:16020-0]
>> coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =>
>> 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
>> failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting
>> version 0
>> 2015-11-24 08:26:45,101 WARN  [RS_OPEN_META-amb2:16020-0]
>> zookeeper.ZKAssign: regionserver:16020-0x1513899be420000,
>> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
>> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
>> to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to
>> transition was amb2.node.dc1.consul,16020,1448353564099 not the
>> expected amb2.service.consul,16020,1448353564099
>>
>>
>> After editing resolv.conf and restarted the cluster it still complains
>> about amb2.node.dc1.consul trying to transition the regions instead of
>> amb2.service.consul.
>>
>> 2015-11-24 09:32:26,334 WARN  [RS_OPEN_META-amb2:16020-0]
>> zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d,
>> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
>> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
>> to RS_ZK_REGION_OPENING failed, the server that tried to transition
>> was amb2.node.dc1.consul,16020,1448357534179 not the expected
>> amb2.service.consul,16020,1448357534179
>> 2015-11-24 09:32:26,335 WARN  [RS_OPEN_META-amb2:16020-0]
>> coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE
>> to OPENING for region=1588230740
>> 2015-11-24 09:32:26,335 WARN  [RS_OPEN_META-amb2:16020-0]
>> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
>> encodedName=1588230740
>> 2015-11-24 09:32:26,335 INFO  [RS_OPEN_META-amb2:16020-0]
>> coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =>
>> 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
>> failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting
>> version 2
>> 2015-11-24 09:32:26,336 WARN  [RS_OPEN_META-amb2:16020-0]
>> zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d,
>> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
>> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
>> to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to
>> transition was amb2.node.dc1.consul,16020,1448357534179 not the
>> expected amb2.service.consul,16020,1448357534179
>>
>>
>> [1] http://pastebin.com/z93p8Mdu
>>
>> On Tue, Nov 24, 2015 at 10:48 AM, Kristoffer Sjögren <st...@gmail.com>
>> wrote:
>> > I removed the node.dc1.consul from resolve.conf and restarted the
>> > cluster but it still shows up on the master UI.
>> >
>> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
>> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
>> >
>> > The logs report [1] that the meta region fails to assign to
>> > node.dc1.consul and then tries to assign it to amb2.service.consul and
>> > gets stuck in PENDING_OPEN again.
>> >
>> > ---
>> > 1588230740hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24
>> > 09:32:26 UTC 2015 (450s ago),
>> > server=amb2.service.consul,16020,1448357534179450511
>> > ---
>> >
>> > Before I restarted the cluster, the master log [2] complained about
>> > not being able to connect to amb2.node.dc1.consul/172.17.0.85:16020.
>> >
>> > Im not sure but somehow it feels as if amb2.node.dc1.consul shadows
>> > the real host amb2.service.consul.
>> >
>> > I was looking into the source code and found the configuration
>> > 'hbase.regionserver.hostname' - could that be of help here to remove
>> > the node.dc1 host?
>> >
>> > [1] http://pastebin.com/uZKqK9BJ
>> > [2] http://pastebin.com/s10E2rtA
>> >
>> > On Tue, Nov 24, 2015 at 10:23 AM, Samir Ahmic <ah...@gmail.com>
>> wrote:
>> >> Hi Kristoffer,
>> >> It looks like you have some issue with name resolution. Try to remove
>> >> incorrect value from reslove.conf (node.dc1.consul) and then restart
>> hbase
>> >> cluster.
>> >> Regarding issue with region in transition check master log for
>> >> "hbase:meta,,1.1588230740"
>> >> there should be exception explaining why hbase:meta can to be transition
>> >> from PENDING_OPEN to OPEN state, if hbase:meta table is unavailable
>> master
>> >> can not finish initialization.
>> >>
>> >> Regards
>> >> Samir
>> >>
>> >> On Tue, Nov 24, 2015 at 10:11 AM, Kristoffer Sjögren <st...@gmail.com>
>> >> wrote:
>> >>
>> >>> Sorry, I should mention that this is HBase 1.1.2.
>> >>>
>> >>> Zookeeper only report one region server.
>> >>>
>> >>> $ ls /hbase-unsecure/rs
>> >>> [amb2.service.consul,16020,1448353564099]
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Nov 24, 2015 at 9:55 AM, Kristoffer Sjögren <st...@gmail.com>
>> >>> wrote:
>> >>> > Hi
>> >>> >
>> >>> > I'm trying to install a HBase cluster with 1 master
>> >>> > (amb1.service.consul) and 1 region server (amb2.service.consul) using
>> >>> > Ambari on docker containers provided by sequenceiq [1] using a custom
>> >>> > blueprint [2].
>> >>> >
>> >>> > Every component installs correctly except for HBase which get stuck
>> >>> > with regions in transition:
>> >>> >
>> >>> > ---
>> >>> > hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24 08:26:45
>> >>> > UTC 2015 (1098s ago), server=amb2.service.consul,16020,1448353564099
>> >>> > ---
>> >>> >
>> >>> > And for some reason 2 region servers (instead of 1) are discovered by
>> >>> > the master with the exact same timestamp but with different
>> hostnames.
>> >>> > I'm not sure if this is the reason why the regions get stuck.
>> >>> >
>> >>> > ----
>> >>> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC
>> 201500
>> >>> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
>> >>> > ----
>> >>> >
>> >>> > The only place I can find "amb2.node.dc1.consul" on the ambari
>> >>> > agent/server hosts is in /etc/resolv.conf which looks like this.
>> >>> >
>> >>> > ----
>> >>> > nameserver 172.17.0.82
>> >>> > search service.consul node.dc1.consul
>> >>> > ----
>> >>> >
>> >>> > Is there some way that I can manually tell the master to disregard
>> the
>> >>> > "phantom" host amb2.node.dc1.consul?
>> >>> >
>> >>> > Any help or tips appreciated.
>> >>> >
>> >>> > Cheers,
>> >>> > -Kristoffer
>> >>> >
>> >>> >
>> >>> > [1] https://github.com/sequenceiq/docker-ambari
>> >>> > [2]
>> >>>
>> https://gist.githubusercontent.com/krisskross/901ed8223c1ed1db80e3/raw/869327be9ad15e6a9f099a7591323244cd245357/ambari-hdp2.3
>> >>>
>>

Re: Phantom region server and PENDING_OPEN regions

Posted by Samir Ahmic <ah...@gmail.com>.
As I can see from logs you also have issue with connecting to zk.
Configuration points to correct server but  server resolution produce wrong
values.  Do you have multiple network interfaces on servers?  What ping
$HOSTNAME returns? What do you have in /etc/hosts file? Do you have some
local nameserver running on servers ?

Regards
Samir
On Nov 24, 2015 11:21 AM, "Kristoffer Sjögren" <st...@gmail.com> wrote:

> The logs on the region server [1] is also quite interesting.
>
> Before I restarted the cluster, the region server complains about
> hijacked amb2.node.dc1.consul hijacked the regions from
> amb2.service.consul.
>
> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
> zookeeper.ZKAssign: regionserver:16020-0x1513899be420000,
> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
> to RS_ZK_REGION_OPENING failed, the server that tried to transition
> was amb2.node.dc1.consul,16020,1448353564099 not the expected
> amb2.service.consul,16020,1448353564099
> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
> coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE
> to OPENING for region=1588230740
> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
> encodedName=1588230740
> 2015-11-24 08:26:45,100 INFO  [RS_OPEN_META-amb2:16020-0]
> coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =>
> 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
> failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting
> version 0
> 2015-11-24 08:26:45,101 WARN  [RS_OPEN_META-amb2:16020-0]
> zookeeper.ZKAssign: regionserver:16020-0x1513899be420000,
> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
> to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to
> transition was amb2.node.dc1.consul,16020,1448353564099 not the
> expected amb2.service.consul,16020,1448353564099
>
>
> After editing resolv.conf and restarted the cluster it still complains
> about amb2.node.dc1.consul trying to transition the regions instead of
> amb2.service.consul.
>
> 2015-11-24 09:32:26,334 WARN  [RS_OPEN_META-amb2:16020-0]
> zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d,
> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
> to RS_ZK_REGION_OPENING failed, the server that tried to transition
> was amb2.node.dc1.consul,16020,1448357534179 not the expected
> amb2.service.consul,16020,1448357534179
> 2015-11-24 09:32:26,335 WARN  [RS_OPEN_META-amb2:16020-0]
> coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE
> to OPENING for region=1588230740
> 2015-11-24 09:32:26,335 WARN  [RS_OPEN_META-amb2:16020-0]
> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
> encodedName=1588230740
> 2015-11-24 09:32:26,335 INFO  [RS_OPEN_META-amb2:16020-0]
> coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =>
> 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
> failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting
> version 2
> 2015-11-24 09:32:26,336 WARN  [RS_OPEN_META-amb2:16020-0]
> zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d,
> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
> to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to
> transition was amb2.node.dc1.consul,16020,1448357534179 not the
> expected amb2.service.consul,16020,1448357534179
>
>
> [1] http://pastebin.com/z93p8Mdu
>
> On Tue, Nov 24, 2015 at 10:48 AM, Kristoffer Sjögren <st...@gmail.com>
> wrote:
> > I removed the node.dc1.consul from resolve.conf and restarted the
> > cluster but it still shows up on the master UI.
> >
> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
> >
> > The logs report [1] that the meta region fails to assign to
> > node.dc1.consul and then tries to assign it to amb2.service.consul and
> > gets stuck in PENDING_OPEN again.
> >
> > ---
> > 1588230740hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24
> > 09:32:26 UTC 2015 (450s ago),
> > server=amb2.service.consul,16020,1448357534179450511
> > ---
> >
> > Before I restarted the cluster, the master log [2] complained about
> > not being able to connect to amb2.node.dc1.consul/172.17.0.85:16020.
> >
> > Im not sure but somehow it feels as if amb2.node.dc1.consul shadows
> > the real host amb2.service.consul.
> >
> > I was looking into the source code and found the configuration
> > 'hbase.regionserver.hostname' - could that be of help here to remove
> > the node.dc1 host?
> >
> > [1] http://pastebin.com/uZKqK9BJ
> > [2] http://pastebin.com/s10E2rtA
> >
> > On Tue, Nov 24, 2015 at 10:23 AM, Samir Ahmic <ah...@gmail.com>
> wrote:
> >> Hi Kristoffer,
> >> It looks like you have some issue with name resolution. Try to remove
> >> incorrect value from reslove.conf (node.dc1.consul) and then restart
> hbase
> >> cluster.
> >> Regarding issue with region in transition check master log for
> >> "hbase:meta,,1.1588230740"
> >> there should be exception explaining why hbase:meta can to be transition
> >> from PENDING_OPEN to OPEN state, if hbase:meta table is unavailable
> master
> >> can not finish initialization.
> >>
> >> Regards
> >> Samir
> >>
> >> On Tue, Nov 24, 2015 at 10:11 AM, Kristoffer Sjögren <st...@gmail.com>
> >> wrote:
> >>
> >>> Sorry, I should mention that this is HBase 1.1.2.
> >>>
> >>> Zookeeper only report one region server.
> >>>
> >>> $ ls /hbase-unsecure/rs
> >>> [amb2.service.consul,16020,1448353564099]
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Nov 24, 2015 at 9:55 AM, Kristoffer Sjögren <st...@gmail.com>
> >>> wrote:
> >>> > Hi
> >>> >
> >>> > I'm trying to install a HBase cluster with 1 master
> >>> > (amb1.service.consul) and 1 region server (amb2.service.consul) using
> >>> > Ambari on docker containers provided by sequenceiq [1] using a custom
> >>> > blueprint [2].
> >>> >
> >>> > Every component installs correctly except for HBase which get stuck
> >>> > with regions in transition:
> >>> >
> >>> > ---
> >>> > hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24 08:26:45
> >>> > UTC 2015 (1098s ago), server=amb2.service.consul,16020,1448353564099
> >>> > ---
> >>> >
> >>> > And for some reason 2 region servers (instead of 1) are discovered by
> >>> > the master with the exact same timestamp but with different
> hostnames.
> >>> > I'm not sure if this is the reason why the regions get stuck.
> >>> >
> >>> > ----
> >>> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC
> 201500
> >>> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
> >>> > ----
> >>> >
> >>> > The only place I can find "amb2.node.dc1.consul" on the ambari
> >>> > agent/server hosts is in /etc/resolv.conf which looks like this.
> >>> >
> >>> > ----
> >>> > nameserver 172.17.0.82
> >>> > search service.consul node.dc1.consul
> >>> > ----
> >>> >
> >>> > Is there some way that I can manually tell the master to disregard
> the
> >>> > "phantom" host amb2.node.dc1.consul?
> >>> >
> >>> > Any help or tips appreciated.
> >>> >
> >>> > Cheers,
> >>> > -Kristoffer
> >>> >
> >>> >
> >>> > [1] https://github.com/sequenceiq/docker-ambari
> >>> > [2]
> >>>
> https://gist.githubusercontent.com/krisskross/901ed8223c1ed1db80e3/raw/869327be9ad15e6a9f099a7591323244cd245357/ambari-hdp2.3
> >>>
>

Re: Phantom region server and PENDING_OPEN regions

Posted by Kristoffer Sjögren <st...@gmail.com>.
The logs on the region server [1] is also quite interesting.

Before I restarted the cluster, the region server complains about
hijacked amb2.node.dc1.consul hijacked the regions from
amb2.service.consul.

2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
zookeeper.ZKAssign: regionserver:16020-0x1513899be420000,
quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
to RS_ZK_REGION_OPENING failed, the server that tried to transition
was amb2.node.dc1.consul,16020,1448353564099 not the expected
amb2.service.consul,16020,1448353564099
2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE
to OPENING for region=1588230740
2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
encodedName=1588230740
2015-11-24 08:26:45,100 INFO  [RS_OPEN_META-amb2:16020-0]
coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =>
1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting
version 0
2015-11-24 08:26:45,101 WARN  [RS_OPEN_META-amb2:16020-0]
zookeeper.ZKAssign: regionserver:16020-0x1513899be420000,
quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to
transition was amb2.node.dc1.consul,16020,1448353564099 not the
expected amb2.service.consul,16020,1448353564099


After editing resolv.conf and restarted the cluster it still complains
about amb2.node.dc1.consul trying to transition the regions instead of
amb2.service.consul.

2015-11-24 09:32:26,334 WARN  [RS_OPEN_META-amb2:16020-0]
zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d,
quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
to RS_ZK_REGION_OPENING failed, the server that tried to transition
was amb2.node.dc1.consul,16020,1448357534179 not the expected
amb2.service.consul,16020,1448357534179
2015-11-24 09:32:26,335 WARN  [RS_OPEN_META-amb2:16020-0]
coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE
to OPENING for region=1588230740
2015-11-24 09:32:26,335 WARN  [RS_OPEN_META-amb2:16020-0]
handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
encodedName=1588230740
2015-11-24 09:32:26,335 INFO  [RS_OPEN_META-amb2:16020-0]
coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =>
1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting
version 2
2015-11-24 09:32:26,336 WARN  [RS_OPEN_META-amb2:16020-0]
zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d,
quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to
transition was amb2.node.dc1.consul,16020,1448357534179 not the
expected amb2.service.consul,16020,1448357534179


[1] http://pastebin.com/z93p8Mdu

On Tue, Nov 24, 2015 at 10:48 AM, Kristoffer Sjögren <st...@gmail.com> wrote:
> I removed the node.dc1.consul from resolve.conf and restarted the
> cluster but it still shows up on the master UI.
>
> amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
> amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
>
> The logs report [1] that the meta region fails to assign to
> node.dc1.consul and then tries to assign it to amb2.service.consul and
> gets stuck in PENDING_OPEN again.
>
> ---
> 1588230740hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24
> 09:32:26 UTC 2015 (450s ago),
> server=amb2.service.consul,16020,1448357534179450511
> ---
>
> Before I restarted the cluster, the master log [2] complained about
> not being able to connect to amb2.node.dc1.consul/172.17.0.85:16020.
>
> Im not sure but somehow it feels as if amb2.node.dc1.consul shadows
> the real host amb2.service.consul.
>
> I was looking into the source code and found the configuration
> 'hbase.regionserver.hostname' - could that be of help here to remove
> the node.dc1 host?
>
> [1] http://pastebin.com/uZKqK9BJ
> [2] http://pastebin.com/s10E2rtA
>
> On Tue, Nov 24, 2015 at 10:23 AM, Samir Ahmic <ah...@gmail.com> wrote:
>> Hi Kristoffer,
>> It looks like you have some issue with name resolution. Try to remove
>> incorrect value from reslove.conf (node.dc1.consul) and then restart hbase
>> cluster.
>> Regarding issue with region in transition check master log for
>> "hbase:meta,,1.1588230740"
>> there should be exception explaining why hbase:meta can to be transition
>> from PENDING_OPEN to OPEN state, if hbase:meta table is unavailable master
>> can not finish initialization.
>>
>> Regards
>> Samir
>>
>> On Tue, Nov 24, 2015 at 10:11 AM, Kristoffer Sjögren <st...@gmail.com>
>> wrote:
>>
>>> Sorry, I should mention that this is HBase 1.1.2.
>>>
>>> Zookeeper only report one region server.
>>>
>>> $ ls /hbase-unsecure/rs
>>> [amb2.service.consul,16020,1448353564099]
>>>
>>>
>>>
>>>
>>> On Tue, Nov 24, 2015 at 9:55 AM, Kristoffer Sjögren <st...@gmail.com>
>>> wrote:
>>> > Hi
>>> >
>>> > I'm trying to install a HBase cluster with 1 master
>>> > (amb1.service.consul) and 1 region server (amb2.service.consul) using
>>> > Ambari on docker containers provided by sequenceiq [1] using a custom
>>> > blueprint [2].
>>> >
>>> > Every component installs correctly except for HBase which get stuck
>>> > with regions in transition:
>>> >
>>> > ---
>>> > hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24 08:26:45
>>> > UTC 2015 (1098s ago), server=amb2.service.consul,16020,1448353564099
>>> > ---
>>> >
>>> > And for some reason 2 region servers (instead of 1) are discovered by
>>> > the master with the exact same timestamp but with different hostnames.
>>> > I'm not sure if this is the reason why the regions get stuck.
>>> >
>>> > ----
>>> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
>>> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
>>> > ----
>>> >
>>> > The only place I can find "amb2.node.dc1.consul" on the ambari
>>> > agent/server hosts is in /etc/resolv.conf which looks like this.
>>> >
>>> > ----
>>> > nameserver 172.17.0.82
>>> > search service.consul node.dc1.consul
>>> > ----
>>> >
>>> > Is there some way that I can manually tell the master to disregard the
>>> > "phantom" host amb2.node.dc1.consul?
>>> >
>>> > Any help or tips appreciated.
>>> >
>>> > Cheers,
>>> > -Kristoffer
>>> >
>>> >
>>> > [1] https://github.com/sequenceiq/docker-ambari
>>> > [2]
>>> https://gist.githubusercontent.com/krisskross/901ed8223c1ed1db80e3/raw/869327be9ad15e6a9f099a7591323244cd245357/ambari-hdp2.3
>>>

Re: Phantom region server and PENDING_OPEN regions

Posted by Kristoffer Sjögren <st...@gmail.com>.
I removed the node.dc1.consul from resolve.conf and restarted the
cluster but it still shows up on the master UI.

amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500

The logs report [1] that the meta region fails to assign to
node.dc1.consul and then tries to assign it to amb2.service.consul and
gets stuck in PENDING_OPEN again.

---
1588230740hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24
09:32:26 UTC 2015 (450s ago),
server=amb2.service.consul,16020,1448357534179450511
---

Before I restarted the cluster, the master log [2] complained about
not being able to connect to amb2.node.dc1.consul/172.17.0.85:16020.

Im not sure but somehow it feels as if amb2.node.dc1.consul shadows
the real host amb2.service.consul.

I was looking into the source code and found the configuration
'hbase.regionserver.hostname' - could that be of help here to remove
the node.dc1 host?

[1] http://pastebin.com/uZKqK9BJ
[2] http://pastebin.com/s10E2rtA

On Tue, Nov 24, 2015 at 10:23 AM, Samir Ahmic <ah...@gmail.com> wrote:
> Hi Kristoffer,
> It looks like you have some issue with name resolution. Try to remove
> incorrect value from reslove.conf (node.dc1.consul) and then restart hbase
> cluster.
> Regarding issue with region in transition check master log for
> "hbase:meta,,1.1588230740"
> there should be exception explaining why hbase:meta can to be transition
> from PENDING_OPEN to OPEN state, if hbase:meta table is unavailable master
> can not finish initialization.
>
> Regards
> Samir
>
> On Tue, Nov 24, 2015 at 10:11 AM, Kristoffer Sjögren <st...@gmail.com>
> wrote:
>
>> Sorry, I should mention that this is HBase 1.1.2.
>>
>> Zookeeper only report one region server.
>>
>> $ ls /hbase-unsecure/rs
>> [amb2.service.consul,16020,1448353564099]
>>
>>
>>
>>
>> On Tue, Nov 24, 2015 at 9:55 AM, Kristoffer Sjögren <st...@gmail.com>
>> wrote:
>> > Hi
>> >
>> > I'm trying to install a HBase cluster with 1 master
>> > (amb1.service.consul) and 1 region server (amb2.service.consul) using
>> > Ambari on docker containers provided by sequenceiq [1] using a custom
>> > blueprint [2].
>> >
>> > Every component installs correctly except for HBase which get stuck
>> > with regions in transition:
>> >
>> > ---
>> > hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24 08:26:45
>> > UTC 2015 (1098s ago), server=amb2.service.consul,16020,1448353564099
>> > ---
>> >
>> > And for some reason 2 region servers (instead of 1) are discovered by
>> > the master with the exact same timestamp but with different hostnames.
>> > I'm not sure if this is the reason why the regions get stuck.
>> >
>> > ----
>> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
>> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
>> > ----
>> >
>> > The only place I can find "amb2.node.dc1.consul" on the ambari
>> > agent/server hosts is in /etc/resolv.conf which looks like this.
>> >
>> > ----
>> > nameserver 172.17.0.82
>> > search service.consul node.dc1.consul
>> > ----
>> >
>> > Is there some way that I can manually tell the master to disregard the
>> > "phantom" host amb2.node.dc1.consul?
>> >
>> > Any help or tips appreciated.
>> >
>> > Cheers,
>> > -Kristoffer
>> >
>> >
>> > [1] https://github.com/sequenceiq/docker-ambari
>> > [2]
>> https://gist.githubusercontent.com/krisskross/901ed8223c1ed1db80e3/raw/869327be9ad15e6a9f099a7591323244cd245357/ambari-hdp2.3
>>

Re: Phantom region server and PENDING_OPEN regions

Posted by Samir Ahmic <ah...@gmail.com>.
Hi Kristoffer,
It looks like you have some issue with name resolution. Try to remove
incorrect value from reslove.conf (node.dc1.consul) and then restart hbase
cluster.
Regarding issue with region in transition check master log for
"hbase:meta,,1.1588230740"
there should be exception explaining why hbase:meta can to be transition
from PENDING_OPEN to OPEN state, if hbase:meta table is unavailable master
can not finish initialization.

Regards
Samir

On Tue, Nov 24, 2015 at 10:11 AM, Kristoffer Sjögren <st...@gmail.com>
wrote:

> Sorry, I should mention that this is HBase 1.1.2.
>
> Zookeeper only report one region server.
>
> $ ls /hbase-unsecure/rs
> [amb2.service.consul,16020,1448353564099]
>
>
>
>
> On Tue, Nov 24, 2015 at 9:55 AM, Kristoffer Sjögren <st...@gmail.com>
> wrote:
> > Hi
> >
> > I'm trying to install a HBase cluster with 1 master
> > (amb1.service.consul) and 1 region server (amb2.service.consul) using
> > Ambari on docker containers provided by sequenceiq [1] using a custom
> > blueprint [2].
> >
> > Every component installs correctly except for HBase which get stuck
> > with regions in transition:
> >
> > ---
> > hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24 08:26:45
> > UTC 2015 (1098s ago), server=amb2.service.consul,16020,1448353564099
> > ---
> >
> > And for some reason 2 region servers (instead of 1) are discovered by
> > the master with the exact same timestamp but with different hostnames.
> > I'm not sure if this is the reason why the regions get stuck.
> >
> > ----
> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
> > ----
> >
> > The only place I can find "amb2.node.dc1.consul" on the ambari
> > agent/server hosts is in /etc/resolv.conf which looks like this.
> >
> > ----
> > nameserver 172.17.0.82
> > search service.consul node.dc1.consul
> > ----
> >
> > Is there some way that I can manually tell the master to disregard the
> > "phantom" host amb2.node.dc1.consul?
> >
> > Any help or tips appreciated.
> >
> > Cheers,
> > -Kristoffer
> >
> >
> > [1] https://github.com/sequenceiq/docker-ambari
> > [2]
> https://gist.githubusercontent.com/krisskross/901ed8223c1ed1db80e3/raw/869327be9ad15e6a9f099a7591323244cd245357/ambari-hdp2.3
>

Re: Phantom region server and PENDING_OPEN regions

Posted by Kristoffer Sjögren <st...@gmail.com>.
Sorry, I should mention that this is HBase 1.1.2.

Zookeeper only report one region server.

$ ls /hbase-unsecure/rs
[amb2.service.consul,16020,1448353564099]




On Tue, Nov 24, 2015 at 9:55 AM, Kristoffer Sjögren <st...@gmail.com> wrote:
> Hi
>
> I'm trying to install a HBase cluster with 1 master
> (amb1.service.consul) and 1 region server (amb2.service.consul) using
> Ambari on docker containers provided by sequenceiq [1] using a custom
> blueprint [2].
>
> Every component installs correctly except for HBase which get stuck
> with regions in transition:
>
> ---
> hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24 08:26:45
> UTC 2015 (1098s ago), server=amb2.service.consul,16020,1448353564099
> ---
>
> And for some reason 2 region servers (instead of 1) are discovered by
> the master with the exact same timestamp but with different hostnames.
> I'm not sure if this is the reason why the regions get stuck.
>
> ----
> amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
> amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
> ----
>
> The only place I can find "amb2.node.dc1.consul" on the ambari
> agent/server hosts is in /etc/resolv.conf which looks like this.
>
> ----
> nameserver 172.17.0.82
> search service.consul node.dc1.consul
> ----
>
> Is there some way that I can manually tell the master to disregard the
> "phantom" host amb2.node.dc1.consul?
>
> Any help or tips appreciated.
>
> Cheers,
> -Kristoffer
>
>
> [1] https://github.com/sequenceiq/docker-ambari
> [2] https://gist.githubusercontent.com/krisskross/901ed8223c1ed1db80e3/raw/869327be9ad15e6a9f099a7591323244cd245357/ambari-hdp2.3