You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by lars hofhansl <lh...@yahoo.com> on 2012/11/04 01:37:54 UTC

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

CC'ing dev list...

Is anybody aware of any changes that went in recently that could cause this?
I looked around a bit, but could not find anything obvious.

-- Lars



________________________________
 From: Matt Corgan <mc...@hotpads.com>
To: user <us...@hbase.apache.org> 
Sent: Saturday, November 3, 2012 5:27 PM
Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
 
I think the cluster is ok without running hbck, as restarting the
regionserver process stops the warnings and everything looks ok otherwise.

here's the regionserver right after the split happens:
------------------------
2012-11-01 22:45:28,726 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x13ab46479832953 Attempting to transition node
bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
*EGION_SPLITTING to RS_ZK_REGION_SPLIT*
2012-11-01 22:45:28,730 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x13ab46479832953 Successfully transitioned node
bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
REGION_SPLITTING to RS_ZK_REGION_SPLIT
2012-11-01 22:45:28,730 DEBUG
org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on the
master to process the split for bc62a8a72124a4ba3f6b9f302587903c
2012-11-01 22:45:28,832 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x13ab46479832953 Attempting to transition node
bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
EGION_SPLIT to RS_ZK_REGION_SPLIT
2012-11-01 22:45:28,837 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x13ab46479832953 Successfully transitioned node
bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
REGION_SPLIT to RS_ZK_REGION_SPLIT
-----------------------------

The "transitioned node from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT"
continues for 15 or so hours and finally settles without manual
intervention with these regionserver log messages:
-----------------------
2012-11-02 13:55:00,906 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x13ab46479832953 Attempting to transition node *
bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
RS_ZK_REGION_SPLIT

2012-11-02 13:55:00,916 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x13ab46479832953 Successfully transitioned node *
bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
RS_ZK_REGION_SPLIT

2012-11-02 13:55:00,916 INFO
org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
updated, and report to master.
Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
*bc62a8a72124a4ba3f6b9f302587903c*., new regions:
ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
Split took 15hrs, 9mins, 33sec

2012-11-02 13:55:00,945 INFO
org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
updated, and report to master.
Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
new regions:
ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins, 2sec
----------------------

And the master finally logs this:
-----------------------
2012-11-02 13:55:00,783 WARN
org.apache.hadoop.hbase.master.AssignmentManager: Region
bc62a8a72124a4ba3f6b9f302587903c not found on server
HadoopNode162.hotpads.srv,60020,1351788248279; failed processing

2012-11-02 13:55:00,783 WARN
org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for region
bc62a8a72124a4ba3f6b9f302587903c from server
HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist anymore,
probably already processed its split
------------------------

I can't find any evidence that the region left the node during that time.
I'll have to catch it in action next time and see what the region is up to
during the problem period.

What does it mean that it successfully transitioned from SPLIT to SPLIT?
Is that a valid transition?

Matt



On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <yu...@gmail.com> wrote:

> Matt:
> From the following we can see that region bc62a8a72124a4ba3f6b9f302587903c
> cannot be found:
>
> 2012-11-02 00:00:02,909 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> transition=RS_ZK_REGION_SPLIT,
> server=HadoopNode162.hotpads.srv,60020,1351788248279,
> region=bc62a8a72124a4ba3f6b9f302587903c
> 2012-11-02 00:00:02,909 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Region
> bc62a8a72124a4ba3f6b9f302587903c *not found on server
> HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> processing
> 2012-11-02 00:00:02,909 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for region
> bc62a8a72124a4ba3f6b9f302587903c from server
> HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist anymore,
> probably already processed its split
>
> Have you run hbck to repair your cluster ?
>
> Thanks
>
> On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
> > Here's a sample of the master's logs from yesterday.  It's not correlated
> > exactly with the other pastebin log, but there's 3GB of this from
> > yesterday: http://pastebin.com/wP2rNN1t
> >
> > I'm am pushing the cluster a bit with importing data so testing the split
> > code harder than normal.  The regions are 500-1GB gzipped.  I can look
> into
> > it more but trying to figure out what to look for.
> >
> > Thanks Ted,
> > Matt
> >
> >
> > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Matt:
> > > This is the method which made the logging:
> > >   private static int tickleNodeSplit(ZooKeeperWatcher zkw,
> > >       HRegionInfo parent, HRegionInfo a, HRegionInfo b, ServerName
> > > serverName,
> > >       final int znodeVersion)
> > >   throws KeeperException, IOException {
> > >     byte [] payload = Writables.getBytes(a, b);
> > >     return ZKAssign.transitionNode(zkw, parent, serverName,
> > >       EventType.RS_ZK_REGION_SPLIT, EventType.RS_ZK_REGION_SPLIT,
> > >       znodeVersion, payload);
> > >   }
> > >
> > > transitionZKNode() calls tickleNodeSplit() when waiting for master to
> > split
> > > the region. Obviously something caused the master not able to split.
> > >
> > > How large is the region ?
> > >
> > > Can you pastebin master log for that period of time ?
> > >
> > > Thanks
> > >
> > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <mc...@hotpads.com>
> wrote:
> > >
> > > > We upgraded from .94.0 to .94.2 last week and have started to
> encounter
> > > > infinite loops of region-transition on splits.  I'm not sure yet if
> > it's
> > > > all splits nor if it's related to load.  Solution so far has been to
> > > > restart the regionserver process.
> > > >
> > > > log snippet:
> > > > http://pastebin.com/LpienZ7B
> > > >
> > > > It's repeating these two lines:
> > > > 2012-11-02 01:35:33,312 DEBUG
> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > regionserver:60020-0x13ab46479832b76 Attempting to transition node
> > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
> > > > RS_ZK_REGION_SPLIT
> > > > 2012-11-02 01:35:33,364 DEBUG
> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > regionserver:60020-0x13ab46479832b76 Successfully transitioned node
> > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
> > > > RS_ZK_REGION_SPLIT
> > > >
> > > > with the occasional:
> > > > 2012-11-02 01:35:34,476 DEBUG
> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting
> on
> > > the
> > > > master to process the split for cf3e9bc069e1888983c06dc8e053ffcf
> > > >
> > > > Should the region transition from RS_ZK_REGION_SPLIT to itself?  It
> > looks
> > > > wrong, but I'm not familiar with the code at all.
> > > >
> > > > Thanks,
> > > > Matt
> > > >
> > >
> >
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by ramkrishna vasudevan <ra...@gmail.com>.

Seems strange.  I too will look into this.  Backtracking the region and its
parent will give an idea.

Regards
Ram

On Sun, Nov 4, 2012 at 7:57 AM, Matt Corgan <mc...@hotpads.com> wrote:

> Strangely, I don't see any record of that region in the master before what
> I already pasted even though I have logs back to 10/30.  Next time it
> happens I'll gather a full log record and try to debug while it's
> occurring.
>
>
> On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
> chrajeshbabu32@gmail.com> wrote:
>
> > Hi Matt,
> > can you paste some more master logs of region
> > bc62a8a72124a4ba3f6b9f30258790 before split.
> > I think Its not problem with splitting.
> > We are getting
> >       LOG.warn("Region " + encodedName + " not found on server " +
> > serverName +
> >         "; failed processing");
> > this log means no entry in servers map(not fully assigned).
> >     Set<HRegionInfo> hris = this.servers.get(sn);
> >     HRegionInfo foundHri = null;
> >     for (HRegionInfo hri: hris) {
> >       if (hri.getEncodedName().equals(encodedName)) {
> >         foundHri = hri;
> >         break;
> >       }
> >     }
> >     return foundHri;
> >
> >
> >
> >
> > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <lh...@yahoo.com>
> wrote:
> >
> > > CC'ing dev list...
> > >
> > > Is anybody aware of any changes that went in recently that could cause
> > > this?
> > > I looked around a bit, but could not find anything obvious.
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ________________________________
> > >  From: Matt Corgan <mc...@hotpads.com>
> > > To: user <us...@hbase.apache.org>
> > > Sent: Saturday, November 3, 2012 5:27 PM
> > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> > >
> > > I think the cluster is ok without running hbck, as restarting the
> > > regionserver process stops the warnings and everything looks ok
> > otherwise.
> > >
> > > here's the regionserver right after the split happens:
> > > ------------------------
> > > 2012-11-01 22:45:28,726 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > regionserver:60020-0x13ab46479832953 Attempting to transition node
> > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> > > 2012-11-01 22:45:28,730 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > regionserver:60020-0x13ab46479832953 Successfully transitioned node
> > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> > > 2012-11-01 22:45:28,730 DEBUG
> > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on
> > the
> > > master to process the split for bc62a8a72124a4ba3f6b9f302587903c
> > > 2012-11-01 22:45:28,832 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > regionserver:60020-0x13ab46479832953 Attempting to transition node
> > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> > > EGION_SPLIT to RS_ZK_REGION_SPLIT
> > > 2012-11-01 22:45:28,837 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > regionserver:60020-0x13ab46479832953 Successfully transitioned node
> > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > > REGION_SPLIT to RS_ZK_REGION_SPLIT
> > > -----------------------------
> > >
> > > The "transitioned node from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT"
> > > continues for 15 or so hours and finally settles without manual
> > > intervention with these regionserver log messages:
> > > -----------------------
> > > 2012-11-02 13:55:00,906 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > regionserver:60020-0x13ab46479832953 Attempting to transition node *
> > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> > > RS_ZK_REGION_SPLIT
> > >
> > > 2012-11-02 13:55:00,916 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > regionserver:60020-0x13ab46479832953 Successfully transitioned node *
> > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> > > RS_ZK_REGION_SPLIT
> > >
> > > 2012-11-02 13:55:00,916 INFO
> > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
> > > updated, and report to master.
> > >
> > >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
> > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> > > Split took 15hrs, 9mins, 33sec
> > >
> > > 2012-11-02 13:55:00,945 INFO
> > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
> > > updated, and report to master.
> > >
> > >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> > > new regions:
> > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> > >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins, 2sec
> > > ----------------------
> > >
> > > And the master finally logs this:
> > > -----------------------
> > > 2012-11-02 13:55:00,783 WARN
> > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > > bc62a8a72124a4ba3f6b9f302587903c not found on server
> > > HadoopNode162.hotpads.srv,60020,1351788248279; failed processing
> > >
> > > 2012-11-02 13:55:00,783 WARN
> > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for
> > region
> > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist
> > anymore,
> > > probably already processed its split
> > > ------------------------
> > >
> > > I can't find any evidence that the region left the node during that
> time.
> > > I'll have to catch it in action next time and see what the region is up
> > to
> > > during the problem period.
> > >
> > > What does it mean that it successfully transitioned from SPLIT to
> SPLIT?
> > > Is that a valid transition?
> > >
> > > Matt
> > >
> > >
> > >
> > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > Matt:
> > > > From the following we can see that region
> > > bc62a8a72124a4ba3f6b9f302587903c
> > > > cannot be found:
> > > >
> > > > 2012-11-02 00:00:02,909 DEBUG
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling
> > > > transition=RS_ZK_REGION_SPLIT,
> > > > server=HadoopNode162.hotpads.srv,60020,1351788248279,
> > > > region=bc62a8a72124a4ba3f6b9f302587903c
> > > > 2012-11-02 00:00:02,909 WARN
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > > > bc62a8a72124a4ba3f6b9f302587903c *not found on server
> > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> > > > processing
> > > > 2012-11-02 00:00:02,909 WARN
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for
> > > region
> > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist
> > > anymore,
> > > > probably already processed its split
> > > >
> > > > Have you run hbck to repair your cluster ?
> > > >
> > > > Thanks
> > > >
> > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <mc...@hotpads.com>
> > wrote:
> > > >
> > > > > Here's a sample of the master's logs from yesterday.  It's not
> > > correlated
> > > > > exactly with the other pastebin log, but there's 3GB of this from
> > > > > yesterday: http://pastebin.com/wP2rNN1t
> > > > >
> > > > > I'm am pushing the cluster a bit with importing data so testing the
> > > split
> > > > > code harder than normal.  The regions are 500-1GB gzipped.  I can
> > look
> > > > into
> > > > > it more but trying to figure out what to look for.
> > > > >
> > > > > Thanks Ted,
> > > > > Matt
> > > > >
> > > > >
> > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > > > >
> > > > > > Matt:
> > > > > > This is the method which made the logging:
> > > > > >   private static int tickleNodeSplit(ZooKeeperWatcher zkw,
> > > > > >       HRegionInfo parent, HRegionInfo a, HRegionInfo b,
> ServerName
> > > > > > serverName,
> > > > > >       final int znodeVersion)
> > > > > >   throws KeeperException, IOException {
> > > > > >     byte [] payload = Writables.getBytes(a, b);
> > > > > >     return ZKAssign.transitionNode(zkw, parent, serverName,
> > > > > >       EventType.RS_ZK_REGION_SPLIT, EventType.RS_ZK_REGION_SPLIT,
> > > > > >       znodeVersion, payload);
> > > > > >   }
> > > > > >
> > > > > > transitionZKNode() calls tickleNodeSplit() when waiting for
> master
> > to
> > > > > split
> > > > > > the region. Obviously something caused the master not able to
> > split.
> > > > > >
> > > > > > How large is the region ?
> > > > > >
> > > > > > Can you pastebin master log for that period of time ?
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <mcorgan@hotpads.com
> >
> > > > wrote:
> > > > > >
> > > > > > > We upgraded from .94.0 to .94.2 last week and have started to
> > > > encounter
> > > > > > > infinite loops of region-transition on splits.  I'm not sure
> yet
> > if
> > > > > it's
> > > > > > > all splits nor if it's related to load.  Solution so far has
> been
> > > to
> > > > > > > restart the regionserver process.
> > > > > > >
> > > > > > > log snippet:
> > > > > > > http://pastebin.com/LpienZ7B
> > > > > > >
> > > > > > > It's repeating these two lines:
> > > > > > > 2012-11-02 01:35:33,312 DEBUG
> > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > > > regionserver:60020-0x13ab46479832b76 Attempting to transition
> > node
> > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
> > > > > > > RS_ZK_REGION_SPLIT
> > > > > > > 2012-11-02 01:35:33,364 DEBUG
> > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > > > regionserver:60020-0x13ab46479832b76 Successfully transitioned
> > node
> > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
> > > > > > > RS_ZK_REGION_SPLIT
> > > > > > >
> > > > > > > with the occasional:
> > > > > > > 2012-11-02 01:35:34,476 DEBUG
> > > > > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still
> > > waiting
> > > > on
> > > > > > the
> > > > > > > master to process the split for
> cf3e9bc069e1888983c06dc8e053ffcf
> > > > > > >
> > > > > > > Should the region transition from RS_ZK_REGION_SPLIT to itself?
> >  It
> > > > > looks
> > > > > > > wrong, but I'm not familiar with the code at all.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Matt
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by Matt Corgan <mc...@hotpads.com>.

I dug back further to find the origin of e11e8b030897d6e5b973f8fe892e0eb2
to see if it had previous problems that left ZK in a bad state.  Here's the
regionserver and master logs: http://pastebin.com/qcvHjNCg from about 2
hours earlier.

* Nov 4, 22:34: region is created as daugher b of a split on node 159
* Nov 4, 22:35: moved to from node 159 to 169 by HBaseAdmin.move()
* Nov 5, 00:24: node 169 tries to split the region but gets Failed create
of ephemeral /hbase/unassigned/e11e8b030897d6e5b973f8fe892e0eb2

Is it possible that if something calls HBaseAdmin.move() on a daugher
region that is 30 seconds old, it could move the region but leave that ZK
node in a bad state?



On Tue, Nov 6, 2012 at 7:13 AM, lars hofhansl <lh...@yahoo.com> wrote:

> Thanks Ram!
>
>
>
>
>
> ----- Original Message -----
> From: ramkrishna vasudevan <ra...@gmail.com>
> To: dev@hbase.apache.org
> Cc:
> Sent: Tuesday, November 6, 2012 3:29 AM
> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>
> Raised HBASE-7103 for the same.
>
> Regards
> Ram
>
> On Tue, Nov 6, 2012 at 3:37 PM, ramkrishna vasudevan <
> ramkrishna.s.vasudevan@gmail.com> wrote:
>
> > Thanks for the logs.
> > I found the reason.
> >
> > The following steps happen
> > -> Initially the parent region P1 starts splitting.
> > -> The split is going on normally.
> > -> Another split starts at the same time for the same region P1. (Not
> sure
> > why this started).
> > -> Rollback happens seeing an already existing node.
> > -> This node gets deleted in rollback and nodeDeleted Event starts.
> > -> In nodeDeleted event the RIT for the region P1 gets deleted.
> > -> Because of this there is no region in RIT.
> > -> Now the first split gets over.  Here the problem is we try to transit
> > the node to SPLITTING to SPLIT. But the node even does not exist.
> > But we don take any action on this.  We think it is successful.
> > -> Even before HBASE-6854 this could have happened.  Will file a JIRA for
> > the same.
> >
> > Regards
> > Ram
> >
> > On Tue, Nov 6, 2012 at 1:42 PM, Matt Corgan <mc...@hotpads.com> wrote:
> >
> >> Ram, here's the master log corresponding to
> http://pastebin.com/cSdMbA2a.
> >> Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on the
> >> regionserver in question (node 169), so i'm guessing that's
> >> 22f8fa73d8af837410ca270f344f6bb8's mommy.
> >>
> >> btw - you can see my balancer kick in 45 seconds later (runs every 10
> >> minutes) here, but so far i think that's coincidence:
> >> 2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster:
> >> BalanceSwitch=false
> >>
> >> I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node
> 169
> >> and found all this stuff about a failed split:
> >> http://pastebin.com/xtXMZ388 and
> >> an attempted rollback.  Looks like it errors out when it goes to put a
> >> node
> >> in ZK but it's already there.  I'm not familiar with what a good split
> log
> >> looks like, so i'll stop commenting for now...
> >>
> >>
> >> On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan <
> >> ramkrishna.s.vasudevan@gmail.com> wrote:
> >>
> >> > The log shows that the first time the region was transitioned to
> >> SPLITTING
> >> > even then it was not populated in the Master's memory.
> >> >
> >> > On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan <
> >> > ramkrishna.s.vasudevan@gmail.com> wrote:
> >> >
> >> > > Could you attach the master logs at this time
> >> > > 2012-11-05 00:24:55?
> >> > >
> >> > > Regards
> >> > > Ram
> >> > >
> >> > > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <lhofhansl@yahoo.com
> >> > >wrote:
> >> > >
> >> > >> Took a brief look through all SPLIT related commits since 0.94.0...
> >> > Found
> >> > >> these:
> >> > >>
> >> > >> HBASE-6854 *
> >> > >> HBASE-6713
> >> > >> HBASE-6329 *
> >> > >>
> >> > >> HBASE-6088
> >> > >>
> >> > >> HBASE-5986
> >> > >> HBASE-6070 *
> >> > >>
> >> > >>
> >> > >> The ones marked with * are (IMHO) more likely to be related.
> >> > >>
> >> > >> -- Lars
> >> > >>
> >> > >> ________________________________
> >> > >> From: Matt Corgan <mc...@hotpads.com>
> >> > >> To: dev <de...@hbase.apache.org>; lars hofhansl <lhofhansl@yahoo.com
> >
> >> > >> Sent: Monday, November 5, 2012 9:28 PM
> >> > >> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> >> > >>
> >> > >> Yeah - we were running .94.0 since it came out but never saw it
> >> there.
> >> > >> I'll keep trying to narrow it down.  The only harm it's causing is
> >> log
> >> > >> spam and failing to move daughters to a new regionserver, which are
> >> > >> definitely problems, but it's not bringing down the cluster.
> >> > >>
> >> > >>
> >> > >> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <lhofhansl@yahoo.com
> >
> >> > >> wrote:
> >> > >>
> >> > >> > So it seems you can repeat this to some extend in 0.94.2, but you
> >> have
> >> > >> > never seen this before?
> >> > >> >
> >> > >> >
> >> > >> > -- Lars
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >> > ________________________________
> >> > >> >  From: Matt Corgan <mc...@hotpads.com>
> >> > >> > To: dev <de...@hbase.apache.org>
> >> > >> > Sent: Monday, November 5, 2012 9:10 PM
> >> > >> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> >> > >> >
> >> > >> > It happened in this new table that I have all the logs for.  The
> >> > region
> >> > >> in
> >> > >> > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the
> >> first
> >> > >> trace
> >> > >> > of this region is on the regionserver with the "Intantiated
> >> > >> tableName..."
> >> > >> > message
> >> > >> >
> >> > >> > 2012-11-05 22:24:21,162 DEBUG
> >> > >> org.apache.hadoop.hbase.regionserver.HRegion:
> >> > >> > Instantiated
> >> > >> >
> >> > >> >
> >> > >>
> >> >
> >>
> StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74
> >> > >> > 9784c21cbc2.
> >> > >> >
> >> > >> > I also know this region came from a recent split, but I can't
> find
> >> any
> >> > >> log
> >> > >> > messages show the parent finishing the split that created this
> >> > daughter
> >> > >> > region.  So my guess now is that the split is actually finishing
> >> and
> >> > >> > letting clients continue to write data, but something is failing
> to
> >> > >> print
> >> > >> > the log line and correctly tell the master about the new region.
> >> > >> >
> >> > >> > I've noticed that these regions are showing up on the clients in
> >> calls
> >> > >> to
> >> > >> > HTable.getRegionLocations(), so the clients continue to function,
> >> but
> >> > >> if I
> >> > >> > call HBaseAdmin.move() i get an UnknownRegionException.
> >> > >> >
> >> > >> >
> >> > >> > On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <yu...@gmail.com>
> >> wrote:
> >> > >> >
> >> > >> > > I think what Matt encountered was this:
> >> > >> > > https://issues.apache.org/jira/browse/HBASE-7101
> >> > >> > >
> >> > >> > > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <
> mcorgan@hotpads.com
> >> >
> >> > >> wrote:
> >> > >> > >
> >> > >> > > > Beats me - i can't find any record of it before that.
> >> > >> > > >
> >> > >> > > > I'm importing data into another table now.  I
> disabled/enabled
> >> the
> >> > >> > table
> >> > >> > > > first to make sure we have the original 4 region locations
> >> logged
> >> > >> > > > everywhere.  Will report back...
> >> > >> > > >
> >> > >> > > >
> >> > >> > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <
> >> > >> > jdcryans@apache.org
> >> > >> > > > >wrote:
> >> > >> > > >
> >> > >> > > > > Wow, and can you figure how this happened?
> >> > >> > > > >
> >> > >> > > > > 2012-11-05 00:24:54,538 DEBUG
> >> > >> > > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated
> >> > >> > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.
> >> > >> > > > >
> >> > >> > > > > Split? Else how did the master told that RS to open a
> region
> >> > that
> >> > >> > > doesn't
> >> > >> > > > > exist?
> >> > >> > > > >
> >> > >> > > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <
> >> > mcorgan@hotpads.com>
> >> > >> > > wrote:
> >> > >> > > > > > Here's one from last night with the master logs at the
> >> bottom.
> >> > >> > > > > > http://pastebin.com/cSdMbA2a
> >> > >> > > > > >
> >> > >> > > > > > I don't see that region in the master logs for 4 days
> back.
> >> >  I'm
> >> > >> > > going
> >> > >> > > > to
> >> > >> > > > > > import some new data from scratch soon and will be sure
> to
> >> > keep
> >> > >> all
> >> > >> > > the
> >> > >> > > > > > master logs.
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <
> >> > >> > > > jdcryans@apache.org
> >> > >> > > > > >wrote:
> >> > >> > > > > >
> >> > >> > > > > >> This reminds me a lot of
> >> > >> > > > > https://issues.apache.org/jira/browse/HBASE-4792
> >> > >> > > > > >>
> >> > >> > > > > >> What I'd like to see is the log from the first time the
> >> > master
> >> > >> > > > > >> receives the split message. I guess it says the region
> >> > doesn't
> >> > >> > exist
> >> > >> > > > > >> anymore because the split was processed already in the
> >> master
> >> > >> but
> >> > >> > > > > >> there's a failure mode similar to 4792.
> >> > >> > > > > >>
> >> > >> > > > > >> I saw this on another cluster last week but the logs
> were
> >> > >> rolling
> >> > >> > > > > >> based on size so the original split message was lost.
> >> > >> > > > > >>
> >> > >> > > > > >> J-D
> >> > >> > > > > >>
> >> > >> > > > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <
> >> > >> mcorgan@hotpads.com>
> >> > >> > > > > wrote:
> >> > >> > > > > >> > I've had a few successful splits today without any
> >> errors.
> >> > >> I'll
> >> > >> > > > turn
> >> > >> > > > > up
> >> > >> > > > > >> > the importer speed tomorrow to start stressing the
> >> cluster
> >> > >> more.
> >> > >> > > > > >> >
> >> > >> > > > > >> >
> >> > >> > > > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <
> >> > yuzhihong@gmail.com>
> >> > >> > > wrote:
> >> > >> > > > > >> >
> >> > >> > > > > >> >> Matt:
> >> > >> > > > > >> >> Was there any region server crash before you noticed
> >> the
> >> > >> > repeated
> >> > >> > > > log
> >> > >> > > > > >> for
> >> > >> > > > > >> >> bc62a8a72124a4ba3f6b9f302587903c ?
> >> > >> > > > > >> >>
> >> > >> > > > > >> >> I wonder if the following JIRA might be related:
> >> > >> > > > > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign
> >> > missing
> >> > >> > > > daughter
> >> > >> > > > > >> >>
> >> > >> > > > > >> >> Thanks
> >> > >> > > > > >> >>
> >> > >> > > > > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <
> >> > >> > mcorgan@hotpads.com
> >> > >> > > >
> >> > >> > > > > >> wrote:
> >> > >> > > > > >> >>
> >> > >> > > > > >> >> > Strangely, I don't see any record of that region in
> >> the
> >> > >> > master
> >> > >> > > > > before
> >> > >> > > > > >> >> what
> >> > >> > > > > >> >> > I already pasted even though I have logs back to
> >> 10/30.
> >> > >> Next
> >> > >> > > > time
> >> > >> > > > > it
> >> > >> > > > > >> >> > happens I'll gather a full log record and try to
> >> debug
> >> > >> while
> >> > >> > > it's
> >> > >> > > > > >> >> > occurring.
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu
> >> > chintaguntla <
> >> > >> > > > > >> >> > chrajeshbabu32@gmail.com> wrote:
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >> > > Hi Matt,
> >> > >> > > > > >> >> > > can you paste some more master logs of region
> >> > >> > > > > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
> >> > >> > > > > >> >> > > I think Its not problem with splitting.
> >> > >> > > > > >> >> > > We are getting
> >> > >> > > > > >> >> > >       LOG.warn("Region " + encodedName + " not
> >> found
> >> > on
> >> > >> > > server
> >> > >> > > > "
> >> > >> > > > > +
> >> > >> > > > > >> >> > > serverName +
> >> > >> > > > > >> >> > >         "; failed processing");
> >> > >> > > > > >> >> > > this log means no entry in servers map(not fully
> >> > >> assigned).
> >> > >> > > > > >> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
> >> > >> > > > > >> >> > >     HRegionInfo foundHri = null;
> >> > >> > > > > >> >> > >     for (HRegionInfo hri: hris) {
> >> > >> > > > > >> >> > >       if
> >> (hri.getEncodedName().equals(encodedName)) {
> >> > >> > > > > >> >> > >         foundHri = hri;
> >> > >> > > > > >> >> > >         break;
> >> > >> > > > > >> >> > >       }
> >> > >> > > > > >> >> > >     }
> >> > >> > > > > >> >> > >     return foundHri;
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <
> >> > >> > > > > lhofhansl@yahoo.com>
> >> > >> > > > > >> >> > wrote:
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> > > > CC'ing dev list...
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > Is anybody aware of any changes that went in
> >> > recently
> >> > >> > that
> >> > >> > > > > could
> >> > >> > > > > >> >> cause
> >> > >> > > > > >> >> > > > this?
> >> > >> > > > > >> >> > > > I looked around a bit, but could not find
> >> anything
> >> > >> > obvious.
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > -- Lars
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > ________________________________
> >> > >> > > > > >> >> > > >  From: Matt Corgan <mc...@hotpads.com>
> >> > >> > > > > >> >> > > > To: user <us...@hbase.apache.org>
> >> > >> > > > > >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
> >> > >> > > > > >> >> > > > Subject: Re: infinite loop of
> RS_ZK_REGION_SPLIT
> >> on
> >> > >> .94.2
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > I think the cluster is ok without running hbck,
> >> as
> >> > >> > > restarting
> >> > >> > > > > the
> >> > >> > > > > >> >> > > > regionserver process stops the warnings and
> >> > everything
> >> > >> > > looks
> >> > >> > > > ok
> >> > >> > > > > >> >> > > otherwise.
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > here's the regionserver right after the split
> >> > happens:
> >> > >> > > > > >> >> > > > ------------------------
> >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,726 DEBUG
> >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting
> >> to
> >> > >> > > transition
> >> > >> > > > > node
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> >> > >> > > > > >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953
> Successfully
> >> > >> > > > transitioned
> >> > >> > > > > >> node
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> >> > >> > > > > >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> >> > >> > > > > >> >> > > >
> >> > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> >> > >> > > Still
> >> > >> > > > > >> waiting
> >> > >> > > > > >> >> on
> >> > >> > > > > >> >> > > the
> >> > >> > > > > >> >> > > > master to process the split for
> >> > >> > > > > bc62a8a72124a4ba3f6b9f302587903c
> >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,832 DEBUG
> >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting
> >> to
> >> > >> > > transition
> >> > >> > > > > node
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> >> > >> > > > > >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
> >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,837 DEBUG
> >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953
> Successfully
> >> > >> > > > transitioned
> >> > >> > > > > >> node
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> >> > >> > > > > >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
> >> > >> > > > > >> >> > > > -----------------------------
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT
> to
> >> > >> > > > > >> RS_ZK_REGION_SPLIT"
> >> > >> > > > > >> >> > > > continues for 15 or so hours and finally
> settles
> >> > >> without
> >> > >> > > > manual
> >> > >> > > > > >> >> > > > intervention with these regionserver log
> >> messages:
> >> > >> > > > > >> >> > > > -----------------------
> >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,906 DEBUG
> >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting
> >> to
> >> > >> > > transition
> >> > >> > > > > >> node *
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
> >> > >> RS_ZK_REGION_SPLIT
> >> > >> > > to
> >> > >> > > > > >> >> > > > RS_ZK_REGION_SPLIT
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,916 DEBUG
> >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953
> Successfully
> >> > >> > > > transitioned
> >> > >> > > > > >> node *
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
> >> > >> RS_ZK_REGION_SPLIT
> >> > >> > > to
> >> > >> > > > > >> >> > > > RS_ZK_REGION_SPLIT
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,916 INFO
> >> > >> > > > > >> >> > > >
> >> org.apache.hadoop.hbase.regionserver.SplitRequest:
> >> > >> Region
> >> > >> > > > > split,
> >> > >> > > > > >> META
> >> > >> > > > > >> >> > > > updated, and report to master.
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >>
> >> > >> > > > > >>
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> >> > >> > > > > >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new
> regions:
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >>
> >> > >> > > > > >>
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >>
> >> > >> > > > > >>
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> >> > >> > > > > >> >> > > > Split took 15hrs, 9mins, 33sec
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,945 INFO
> >> > >> > > > > >> >> > > >
> >> org.apache.hadoop.hbase.regionserver.SplitRequest:
> >> > >> Region
> >> > >> > > > > split,
> >> > >> > > > > >> META
> >> > >> > > > > >> >> > > > updated, and report to master.
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >>
> >> > >> > > > > >>
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> >> > >> > > > > >> >> > > > new regions:
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >>
> >> > >> > > > > >>
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> >
> >> > >> > > > > >>
> >> > >> > > >
> >> > >> >
> >> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took
> >> > 17hrs,
> >> > >> > > 14mins,
> >> > >> > > > > 2sec
> >> > >> > > > > >> >> > > > ----------------------
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > And the master finally logs this:
> >> > >> > > > > >> >> > > > -----------------------
> >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> >> > >> > > > > >> >> > > >
> org.apache.hadoop.hbase.master.AssignmentManager:
> >> > >> Region
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on
> >> server
> >> > >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279;
> >> > failed
> >> > >> > > > > processing
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> >> > >> > > > > >> >> > > >
> org.apache.hadoop.hbase.master.AssignmentManager:
> >> > >> > Received
> >> > >> > > > > SPLIT
> >> > >> > > > > >> for
> >> > >> > > > > >> >> > > region
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
> >> > >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279
> >> but it
> >> > >> > > doesn't
> >> > >> > > > > exist
> >> > >> > > > > >> >> > > anymore,
> >> > >> > > > > >> >> > > > probably already processed its split
> >> > >> > > > > >> >> > > > ------------------------
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > I can't find any evidence that the region left
> >> the
> >> > >> node
> >> > >> > > > during
> >> > >> > > > > >> that
> >> > >> > > > > >> >> > time.
> >> > >> > > > > >> >> > > > I'll have to catch it in action next time and
> see
> >> > what
> >> > >> > the
> >> > >> > > > > region
> >> > >> > > > > >> is
> >> > >> > > > > >> >> up
> >> > >> > > > > >> >> > > to
> >> > >> > > > > >> >> > > > during the problem period.
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > What does it mean that it successfully
> >> transitioned
> >> > >> from
> >> > >> > > > SPLIT
> >> > >> > > > > to
> >> > >> > > > > >> >> > SPLIT?
> >> > >> > > > > >> >> > > > Is that a valid transition?
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > Matt
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <
> >> > >> > > yuzhihong@gmail.com>
> >> > >> > > > > >> wrote:
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > > Matt:
> >> > >> > > > > >> >> > > > > From the following we can see that region
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c
> >> > >> > > > > >> >> > > > > cannot be found:
> >> > >> > > > > >> >> > > > >
> >> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 DEBUG
> >> > >> > > > > >> >> > > > >
> >> org.apache.hadoop.hbase.master.AssignmentManager:
> >> > >> > > Handling
> >> > >> > > > > >> >> > > > > transition=RS_ZK_REGION_SPLIT,
> >> > >> > > > > >> >> > > > >
> >> > >> server=HadoopNode162.hotpads.srv,60020,1351788248279,
> >> > >> > > > > >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
> >> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> >> > >> > > > > >> >> > > > >
> >> org.apache.hadoop.hbase.master.AssignmentManager:
> >> > >> > Region
> >> > >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found
> on
> >> > >> server
> >> > >> > > > > >> >> > > > >
> >> > >> HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> >> > >> > > > > >> >> > > > > processing
> >> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> >> > >> > > > > >> >> > > > >
> >> org.apache.hadoop.hbase.master.AssignmentManager:
> >> > >> > > Received
> >> > >> > > > > SPLIT
> >> > >> > > > > >> >> for
> >> > >> > > > > >> >> > > > region
> >> > >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
> >> > >> > > > > >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279
> >> but
> >> > it
> >> > >> > > > doesn't
> >> > >> > > > > >> exist
> >> > >> > > > > >> >> > > > anymore,
> >> > >> > > > > >> >> > > > > probably already processed its split
> >> > >> > > > > >> >> > > > >
> >> > >> > > > > >> >> > > > > Have you run hbck to repair your cluster ?
> >> > >> > > > > >> >> > > > >
> >> > >> > > > > >> >> > > > > Thanks
> >> > >> > > > > >> >> > > > >
> >> > >> > > > > >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
> >> > >> > > > > >> mcorgan@hotpads.com>
> >> > >> > > > > >> >> > > wrote:
> >> > >> > > > > >> >> > > > >
> >> > >> > > > > >> >> > > > > > Here's a sample of the master's logs from
> >> > >> yesterday.
> >> > >> > > >  It's
> >> > >> > > > > not
> >> > >> > > > > >> >> > > > correlated
> >> > >> > > > > >> >> > > > > > exactly with the other pastebin log, but
> >> there's
> >> > >> 3GB
> >> > >> > of
> >> > >> > > > > this
> >> > >> > > > > >> from
> >> > >> > > > > >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
> >> > >> > > > > >> >> > > > > >
> >> > >> > > > > >> >> > > > > > I'm am pushing the cluster a bit with
> >> importing
> >> > >> data
> >> > >> > so
> >> > >> > > > > >> testing
> >> > >> > > > > >> >> the
> >> > >> > > > > >> >> > > > split
> >> > >> > > > > >> >> > > > > > code harder than normal.  The regions are
> >> > 500-1GB
> >> > >> > > > gzipped.
> >> > >> > > > >  I
> >> > >> > > > > >> can
> >> > >> > > > > >> >> > > look
> >> > >> > > > > >> >> > > > > into
> >> > >> > > > > >> >> > > > > > it more but trying to figure out what to
> look
> >> > for.
> >> > >> > > > > >> >> > > > > >
> >> > >> > > > > >> >> > > > > > Thanks Ted,
> >> > >> > > > > >> >> > > > > > Matt
> >> > >> > > > > >> >> > > > > >
> >> > >> > > > > >> >> > > > > >
> >> > >> > > > > >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <
> >> > >> > > > > yuzhihong@gmail.com>
> >> > >> > > > > >> >> > wrote:
> >> > >> > > > > >> >> > > > > >
> >> > >> > > > > >> >> > > > > > > Matt:
> >> > >> > > > > >> >> > > > > > > This is the method which made the
> logging:
> >> > >> > > > > >> >> > > > > > >   private static int
> >> > >> > tickleNodeSplit(ZooKeeperWatcher
> >> > >> > > > > zkw,
> >> > >> > > > > >> >> > > > > > >       HRegionInfo parent, HRegionInfo a,
> >> > >> > HRegionInfo
> >> > >> > > b,
> >> > >> > > > > >> >> > ServerName
> >> > >> > > > > >> >> > > > > > > serverName,
> >> > >> > > > > >> >> > > > > > >       final int znodeVersion)
> >> > >> > > > > >> >> > > > > > >   throws KeeperException, IOException {
> >> > >> > > > > >> >> > > > > > >     byte [] payload =
> Writables.getBytes(a,
> >> > b);
> >> > >> > > > > >> >> > > > > > >     return ZKAssign.transitionNode(zkw,
> >> > parent,
> >> > >> > > > > serverName,
> >> > >> > > > > >> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
> >> > >> > > > > >> >> EventType.RS_ZK_REGION_SPLIT,
> >> > >> > > > > >> >> > > > > > >       znodeVersion, payload);
> >> > >> > > > > >> >> > > > > > >   }
> >> > >> > > > > >> >> > > > > > >
> >> > >> > > > > >> >> > > > > > > transitionZKNode() calls
> tickleNodeSplit()
> >> > when
> >> > >> > > waiting
> >> > >> > > > > for
> >> > >> > > > > >> >> > master
> >> > >> > > > > >> >> > > to
> >> > >> > > > > >> >> > > > > > split
> >> > >> > > > > >> >> > > > > > > the region. Obviously something caused
> the
> >> > >> master
> >> > >> > not
> >> > >> > > > > able
> >> > >> > > > > >> to
> >> > >> > > > > >> >> > > split.
> >> > >> > > > > >> >> > > > > > >
> >> > >> > > > > >> >> > > > > > > How large is the region ?
> >> > >> > > > > >> >> > > > > > >
> >> > >> > > > > >> >> > > > > > > Can you pastebin master log for that
> >> period of
> >> > >> > time ?
> >> > >> > > > > >> >> > > > > > >
> >> > >> > > > > >> >> > > > > > > Thanks
> >> > >> > > > > >> >> > > > > > >
> >> > >> > > > > >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt
> >> Corgan <
> >> > >> > > > > >> >> mcorgan@hotpads.com
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> > > > > wrote:
> >> > >> > > > > >> >> > > > > > >
> >> > >> > > > > >> >> > > > > > > > We upgraded from .94.0 to .94.2 last
> week
> >> > and
> >> > >> > have
> >> > >> > > > > >> started to
> >> > >> > > > > >> >> > > > > encounter
> >> > >> > > > > >> >> > > > > > > > infinite loops of region-transition on
> >> > splits.
> >> > >> > I'm
> >> > >> > > > not
> >> > >> > > > > >> sure
> >> > >> > > > > >> >> > yet
> >> > >> > > > > >> >> > > if
> >> > >> > > > > >> >> > > > > > it's
> >> > >> > > > > >> >> > > > > > > > all splits nor if it's related to load.
> >> > >> Solution
> >> > >> > > so
> >> > >> > > > > far
> >> > >> > > > > >> has
> >> > >> > > > > >> >> > been
> >> > >> > > > > >> >> > > > to
> >> > >> > > > > >> >> > > > > > > > restart the regionserver process.
> >> > >> > > > > >> >> > > > > > > >
> >> > >> > > > > >> >> > > > > > > > log snippet:
> >> > >> > > > > >> >> > > > > > > > http://pastebin.com/LpienZ7B
> >> > >> > > > > >> >> > > > > > > >
> >> > >> > > > > >> >> > > > > > > > It's repeating these two lines:
> >> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
> >> > >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
> >> > >> Attempting
> >> > >> > to
> >> > >> > > > > >> transition
> >> > >> > > > > >> >> > > node
> >> > >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> >> > >> > > > > RS_ZK_REGION_SPLIT
> >> > >> > > > > >> to
> >> > >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> >> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
> >> > >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
> >> > >> Successfully
> >> > >> > > > > >> >> transitioned
> >> > >> > > > > >> >> > > node
> >> > >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> >> > >> > > > > RS_ZK_REGION_SPLIT
> >> > >> > > > > >> to
> >> > >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> >> > >> > > > > >> >> > > > > > > >
> >> > >> > > > > >> >> > > > > > > > with the occasional:
> >> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
> >> > >> > > > > >> >> > > > > > > >
> >> > >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> >> > >> > > > > >> Still
> >> > >> > > > > >> >> > > > waiting
> >> > >> > > > > >> >> > > > > on
> >> > >> > > > > >> >> > > > > > > the
> >> > >> > > > > >> >> > > > > > > > master to process the split for
> >> > >> > > > > >> >> > cf3e9bc069e1888983c06dc8e053ffcf
> >> > >> > > > > >> >> > > > > > > >
> >> > >> > > > > >> >> > > > > > > > Should the region transition from
> >> > >> > > RS_ZK_REGION_SPLIT
> >> > >> > > > to
> >> > >> > > > > >> >> itself?
> >> > >> > > > > >> >> > >  It
> >> > >> > > > > >> >> > > > > > looks
> >> > >> > > > > >> >> > > > > > > > wrong, but I'm not familiar with the
> >> code at
> >> > >> all.
> >> > >> > > > > >> >> > > > > > > >
> >> > >> > > > > >> >> > > > > > > > Thanks,
> >> > >> > > > > >> >> > > > > > > > Matt
> >> > >> > > > > >> >> > > > > > > >
> >> > >> > > > > >> >> > > > > > >
> >> > >> > > > > >> >> > > > > >
> >> > >> > > > > >> >> > > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >>
> >> > >> > > > > >>
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
> >
>
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by lars hofhansl <lh...@yahoo.com>.

Thanks Ram!





----- Original Message -----
From: ramkrishna vasudevan <ra...@gmail.com>
To: dev@hbase.apache.org
Cc: 
Sent: Tuesday, November 6, 2012 3:29 AM
Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Raised HBASE-7103 for the same.

Regards
Ram

On Tue, Nov 6, 2012 at 3:37 PM, ramkrishna vasudevan <
ramkrishna.s.vasudevan@gmail.com> wrote:

> Thanks for the logs.
> I found the reason.
>
> The following steps happen
> -> Initially the parent region P1 starts splitting.
> -> The split is going on normally.
> -> Another split starts at the same time for the same region P1. (Not sure
> why this started).
> -> Rollback happens seeing an already existing node.
> -> This node gets deleted in rollback and nodeDeleted Event starts.
> -> In nodeDeleted event the RIT for the region P1 gets deleted.
> -> Because of this there is no region in RIT.
> -> Now the first split gets over.  Here the problem is we try to transit
> the node to SPLITTING to SPLIT. But the node even does not exist.
> But we don take any action on this.  We think it is successful.
> -> Even before HBASE-6854 this could have happened.  Will file a JIRA for
> the same.
>
> Regards
> Ram
>
> On Tue, Nov 6, 2012 at 1:42 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
>> Ram, here's the master log corresponding to http://pastebin.com/cSdMbA2a.
>> Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on the
>> regionserver in question (node 169), so i'm guessing that's
>> 22f8fa73d8af837410ca270f344f6bb8's mommy.
>>
>> btw - you can see my balancer kick in 45 seconds later (runs every 10
>> minutes) here, but so far i think that's coincidence:
>> 2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster:
>> BalanceSwitch=false
>>
>> I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node 169
>> and found all this stuff about a failed split:
>> http://pastebin.com/xtXMZ388 and
>> an attempted rollback.  Looks like it errors out when it goes to put a
>> node
>> in ZK but it's already there.  I'm not familiar with what a good split log
>> looks like, so i'll stop commenting for now...
>>
>>
>> On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan <
>> ramkrishna.s.vasudevan@gmail.com> wrote:
>>
>> > The log shows that the first time the region was transitioned to
>> SPLITTING
>> > even then it was not populated in the Master's memory.
>> >
>> > On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan <
>> > ramkrishna.s.vasudevan@gmail.com> wrote:
>> >
>> > > Could you attach the master logs at this time
>> > > 2012-11-05 00:24:55?
>> > >
>> > > Regards
>> > > Ram
>> > >
>> > > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <lhofhansl@yahoo.com
>> > >wrote:
>> > >
>> > >> Took a brief look through all SPLIT related commits since 0.94.0...
>> > Found
>> > >> these:
>> > >>
>> > >> HBASE-6854 *
>> > >> HBASE-6713
>> > >> HBASE-6329 *
>> > >>
>> > >> HBASE-6088
>> > >>
>> > >> HBASE-5986
>> > >> HBASE-6070 *
>> > >>
>> > >>
>> > >> The ones marked with * are (IMHO) more likely to be related.
>> > >>
>> > >> -- Lars
>> > >>
>> > >> ________________________________
>> > >> From: Matt Corgan <mc...@hotpads.com>
>> > >> To: dev <de...@hbase.apache.org>; lars hofhansl <lh...@yahoo.com>
>> > >> Sent: Monday, November 5, 2012 9:28 PM
>> > >> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>> > >>
>> > >> Yeah - we were running .94.0 since it came out but never saw it
>> there.
>> > >> I'll keep trying to narrow it down.  The only harm it's causing is
>> log
>> > >> spam and failing to move daughters to a new regionserver, which are
>> > >> definitely problems, but it's not bringing down the cluster.
>> > >>
>> > >>
>> > >> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <lh...@yahoo.com>
>> > >> wrote:
>> > >>
>> > >> > So it seems you can repeat this to some extend in 0.94.2, but you
>> have
>> > >> > never seen this before?
>> > >> >
>> > >> >
>> > >> > -- Lars
>> > >> >
>> > >> >
>> > >> >
>> > >> > ________________________________
>> > >> >  From: Matt Corgan <mc...@hotpads.com>
>> > >> > To: dev <de...@hbase.apache.org>
>> > >> > Sent: Monday, November 5, 2012 9:10 PM
>> > >> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>> > >> >
>> > >> > It happened in this new table that I have all the logs for.  The
>> > region
>> > >> in
>> > >> > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the
>> first
>> > >> trace
>> > >> > of this region is on the regionserver with the "Intantiated
>> > >> tableName..."
>> > >> > message
>> > >> >
>> > >> > 2012-11-05 22:24:21,162 DEBUG
>> > >> org.apache.hadoop.hbase.regionserver.HRegion:
>> > >> > Instantiated
>> > >> >
>> > >> >
>> > >>
>> >
>> StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74
>> > >> > 9784c21cbc2.
>> > >> >
>> > >> > I also know this region came from a recent split, but I can't find
>> any
>> > >> log
>> > >> > messages show the parent finishing the split that created this
>> > daughter
>> > >> > region.  So my guess now is that the split is actually finishing
>> and
>> > >> > letting clients continue to write data, but something is failing to
>> > >> print
>> > >> > the log line and correctly tell the master about the new region.
>> > >> >
>> > >> > I've noticed that these regions are showing up on the clients in
>> calls
>> > >> to
>> > >> > HTable.getRegionLocations(), so the clients continue to function,
>> but
>> > >> if I
>> > >> > call HBaseAdmin.move() i get an UnknownRegionException.
>> > >> >
>> > >> >
>> > >> > On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <yu...@gmail.com>
>> wrote:
>> > >> >
>> > >> > > I think what Matt encountered was this:
>> > >> > > https://issues.apache.org/jira/browse/HBASE-7101
>> > >> > >
>> > >> > > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <mcorgan@hotpads.com
>> >
>> > >> wrote:
>> > >> > >
>> > >> > > > Beats me - i can't find any record of it before that.
>> > >> > > >
>> > >> > > > I'm importing data into another table now.  I disabled/enabled
>> the
>> > >> > table
>> > >> > > > first to make sure we have the original 4 region locations
>> logged
>> > >> > > > everywhere.  Will report back...
>> > >> > > >
>> > >> > > >
>> > >> > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <
>> > >> > jdcryans@apache.org
>> > >> > > > >wrote:
>> > >> > > >
>> > >> > > > > Wow, and can you figure how this happened?
>> > >> > > > >
>> > >> > > > > 2012-11-05 00:24:54,538 DEBUG
>> > >> > > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated
>> > >> > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.
>> > >> > > > >
>> > >> > > > > Split? Else how did the master told that RS to open a region
>> > that
>> > >> > > doesn't
>> > >> > > > > exist?
>> > >> > > > >
>> > >> > > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <
>> > mcorgan@hotpads.com>
>> > >> > > wrote:
>> > >> > > > > > Here's one from last night with the master logs at the
>> bottom.
>> > >> > > > > > http://pastebin.com/cSdMbA2a
>> > >> > > > > >
>> > >> > > > > > I don't see that region in the master logs for 4 days back.
>> >  I'm
>> > >> > > going
>> > >> > > > to
>> > >> > > > > > import some new data from scratch soon and will be sure to
>> > keep
>> > >> all
>> > >> > > the
>> > >> > > > > > master logs.
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <
>> > >> > > > jdcryans@apache.org
>> > >> > > > > >wrote:
>> > >> > > > > >
>> > >> > > > > >> This reminds me a lot of
>> > >> > > > > https://issues.apache.org/jira/browse/HBASE-4792
>> > >> > > > > >>
>> > >> > > > > >> What I'd like to see is the log from the first time the
>> > master
>> > >> > > > > >> receives the split message. I guess it says the region
>> > doesn't
>> > >> > exist
>> > >> > > > > >> anymore because the split was processed already in the
>> master
>> > >> but
>> > >> > > > > >> there's a failure mode similar to 4792.
>> > >> > > > > >>
>> > >> > > > > >> I saw this on another cluster last week but the logs were
>> > >> rolling
>> > >> > > > > >> based on size so the original split message was lost.
>> > >> > > > > >>
>> > >> > > > > >> J-D
>> > >> > > > > >>
>> > >> > > > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <
>> > >> mcorgan@hotpads.com>
>> > >> > > > > wrote:
>> > >> > > > > >> > I've had a few successful splits today without any
>> errors.
>> > >> I'll
>> > >> > > > turn
>> > >> > > > > up
>> > >> > > > > >> > the importer speed tomorrow to start stressing the
>> cluster
>> > >> more.
>> > >> > > > > >> >
>> > >> > > > > >> >
>> > >> > > > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <
>> > yuzhihong@gmail.com>
>> > >> > > wrote:
>> > >> > > > > >> >
>> > >> > > > > >> >> Matt:
>> > >> > > > > >> >> Was there any region server crash before you noticed
>> the
>> > >> > repeated
>> > >> > > > log
>> > >> > > > > >> for
>> > >> > > > > >> >> bc62a8a72124a4ba3f6b9f302587903c ?
>> > >> > > > > >> >>
>> > >> > > > > >> >> I wonder if the following JIRA might be related:
>> > >> > > > > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign
>> > missing
>> > >> > > > daughter
>> > >> > > > > >> >>
>> > >> > > > > >> >> Thanks
>> > >> > > > > >> >>
>> > >> > > > > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <
>> > >> > mcorgan@hotpads.com
>> > >> > > >
>> > >> > > > > >> wrote:
>> > >> > > > > >> >>
>> > >> > > > > >> >> > Strangely, I don't see any record of that region in
>> the
>> > >> > master
>> > >> > > > > before
>> > >> > > > > >> >> what
>> > >> > > > > >> >> > I already pasted even though I have logs back to
>> 10/30.
>> > >> Next
>> > >> > > > time
>> > >> > > > > it
>> > >> > > > > >> >> > happens I'll gather a full log record and try to
>> debug
>> > >> while
>> > >> > > it's
>> > >> > > > > >> >> > occurring.
>> > >> > > > > >> >> >
>> > >> > > > > >> >> >
>> > >> > > > > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu
>> > chintaguntla <
>> > >> > > > > >> >> > chrajeshbabu32@gmail.com> wrote:
>> > >> > > > > >> >> >
>> > >> > > > > >> >> > > Hi Matt,
>> > >> > > > > >> >> > > can you paste some more master logs of region
>> > >> > > > > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
>> > >> > > > > >> >> > > I think Its not problem with splitting.
>> > >> > > > > >> >> > > We are getting
>> > >> > > > > >> >> > >       LOG.warn("Region " + encodedName + " not
>> found
>> > on
>> > >> > > server
>> > >> > > > "
>> > >> > > > > +
>> > >> > > > > >> >> > > serverName +
>> > >> > > > > >> >> > >         "; failed processing");
>> > >> > > > > >> >> > > this log means no entry in servers map(not fully
>> > >> assigned).
>> > >> > > > > >> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
>> > >> > > > > >> >> > >     HRegionInfo foundHri = null;
>> > >> > > > > >> >> > >     for (HRegionInfo hri: hris) {
>> > >> > > > > >> >> > >       if
>> (hri.getEncodedName().equals(encodedName)) {
>> > >> > > > > >> >> > >         foundHri = hri;
>> > >> > > > > >> >> > >         break;
>> > >> > > > > >> >> > >       }
>> > >> > > > > >> >> > >     }
>> > >> > > > > >> >> > >     return foundHri;
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <
>> > >> > > > > lhofhansl@yahoo.com>
>> > >> > > > > >> >> > wrote:
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> > > > CC'ing dev list...
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > Is anybody aware of any changes that went in
>> > recently
>> > >> > that
>> > >> > > > > could
>> > >> > > > > >> >> cause
>> > >> > > > > >> >> > > > this?
>> > >> > > > > >> >> > > > I looked around a bit, but could not find
>> anything
>> > >> > obvious.
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > -- Lars
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > ________________________________
>> > >> > > > > >> >> > > >  From: Matt Corgan <mc...@hotpads.com>
>> > >> > > > > >> >> > > > To: user <us...@hbase.apache.org>
>> > >> > > > > >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
>> > >> > > > > >> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT
>> on
>> > >> .94.2
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > I think the cluster is ok without running hbck,
>> as
>> > >> > > restarting
>> > >> > > > > the
>> > >> > > > > >> >> > > > regionserver process stops the warnings and
>> > everything
>> > >> > > looks
>> > >> > > > ok
>> > >> > > > > >> >> > > otherwise.
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > here's the regionserver right after the split
>> > happens:
>> > >> > > > > >> >> > > > ------------------------
>> > >> > > > > >> >> > > > 2012-11-01 22:45:28,726 DEBUG
>> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting
>> to
>> > >> > > transition
>> > >> > > > > node
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
>> > >> > > > > >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
>> > >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
>> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
>> > >> > > > transitioned
>> > >> > > > > >> node
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
>> > >> > > > > >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
>> > >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
>> > >> > > > > >> >> > > >
>> > org.apache.hadoop.hbase.regionserver.SplitTransaction:
>> > >> > > Still
>> > >> > > > > >> waiting
>> > >> > > > > >> >> on
>> > >> > > > > >> >> > > the
>> > >> > > > > >> >> > > > master to process the split for
>> > >> > > > > bc62a8a72124a4ba3f6b9f302587903c
>> > >> > > > > >> >> > > > 2012-11-01 22:45:28,832 DEBUG
>> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting
>> to
>> > >> > > transition
>> > >> > > > > node
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
>> > >> > > > > >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
>> > >> > > > > >> >> > > > 2012-11-01 22:45:28,837 DEBUG
>> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
>> > >> > > > transitioned
>> > >> > > > > >> node
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
>> > >> > > > > >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
>> > >> > > > > >> >> > > > -----------------------------
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to
>> > >> > > > > >> RS_ZK_REGION_SPLIT"
>> > >> > > > > >> >> > > > continues for 15 or so hours and finally settles
>> > >> without
>> > >> > > > manual
>> > >> > > > > >> >> > > > intervention with these regionserver log
>> messages:
>> > >> > > > > >> >> > > > -----------------------
>> > >> > > > > >> >> > > > 2012-11-02 13:55:00,906 DEBUG
>> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting
>> to
>> > >> > > transition
>> > >> > > > > >> node *
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
>> > >> RS_ZK_REGION_SPLIT
>> > >> > > to
>> > >> > > > > >> >> > > > RS_ZK_REGION_SPLIT
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > 2012-11-02 13:55:00,916 DEBUG
>> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
>> > >> > > > transitioned
>> > >> > > > > >> node *
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
>> > >> RS_ZK_REGION_SPLIT
>> > >> > > to
>> > >> > > > > >> >> > > > RS_ZK_REGION_SPLIT
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > 2012-11-02 13:55:00,916 INFO
>> > >> > > > > >> >> > > >
>> org.apache.hadoop.hbase.regionserver.SplitRequest:
>> > >> Region
>> > >> > > > > split,
>> > >> > > > > >> META
>> > >> > > > > >> >> > > > updated, and report to master.
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> >
>> > >> > > > > >> >>
>> > >> > > > > >>
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
>> > >> > > > > >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> >
>> > >> > > > > >> >>
>> > >> > > > > >>
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> >
>> > >> > > > > >> >>
>> > >> > > > > >>
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
>> > >> > > > > >> >> > > > Split took 15hrs, 9mins, 33sec
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > 2012-11-02 13:55:00,945 INFO
>> > >> > > > > >> >> > > >
>> org.apache.hadoop.hbase.regionserver.SplitRequest:
>> > >> Region
>> > >> > > > > split,
>> > >> > > > > >> META
>> > >> > > > > >> >> > > > updated, and report to master.
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> >
>> > >> > > > > >> >>
>> > >> > > > > >>
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
>> > >> > > > > >> >> > > > new regions:
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> >
>> > >> > > > > >> >>
>> > >> > > > > >>
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> >
>> > >> > > > > >>
>> > >> > > >
>> > >> >
>> > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took
>> > 17hrs,
>> > >> > > 14mins,
>> > >> > > > > 2sec
>> > >> > > > > >> >> > > > ----------------------
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > And the master finally logs this:
>> > >> > > > > >> >> > > > -----------------------
>> > >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
>> > >> > > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager:
>> > >> Region
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on
>> server
>> > >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279;
>> > failed
>> > >> > > > > processing
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
>> > >> > > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager:
>> > >> > Received
>> > >> > > > > SPLIT
>> > >> > > > > >> for
>> > >> > > > > >> >> > > region
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
>> > >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279
>> but it
>> > >> > > doesn't
>> > >> > > > > exist
>> > >> > > > > >> >> > > anymore,
>> > >> > > > > >> >> > > > probably already processed its split
>> > >> > > > > >> >> > > > ------------------------
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > I can't find any evidence that the region left
>> the
>> > >> node
>> > >> > > > during
>> > >> > > > > >> that
>> > >> > > > > >> >> > time.
>> > >> > > > > >> >> > > > I'll have to catch it in action next time and see
>> > what
>> > >> > the
>> > >> > > > > region
>> > >> > > > > >> is
>> > >> > > > > >> >> up
>> > >> > > > > >> >> > > to
>> > >> > > > > >> >> > > > during the problem period.
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > What does it mean that it successfully
>> transitioned
>> > >> from
>> > >> > > > SPLIT
>> > >> > > > > to
>> > >> > > > > >> >> > SPLIT?
>> > >> > > > > >> >> > > > Is that a valid transition?
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > Matt
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <
>> > >> > > yuzhihong@gmail.com>
>> > >> > > > > >> wrote:
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > > Matt:
>> > >> > > > > >> >> > > > > From the following we can see that region
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c
>> > >> > > > > >> >> > > > > cannot be found:
>> > >> > > > > >> >> > > > >
>> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 DEBUG
>> > >> > > > > >> >> > > > >
>> org.apache.hadoop.hbase.master.AssignmentManager:
>> > >> > > Handling
>> > >> > > > > >> >> > > > > transition=RS_ZK_REGION_SPLIT,
>> > >> > > > > >> >> > > > >
>> > >> server=HadoopNode162.hotpads.srv,60020,1351788248279,
>> > >> > > > > >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
>> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
>> > >> > > > > >> >> > > > >
>> org.apache.hadoop.hbase.master.AssignmentManager:
>> > >> > Region
>> > >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on
>> > >> server
>> > >> > > > > >> >> > > > >
>> > >> HadoopNode162.hotpads*.srv,60020,1351788248279;failed
>> > >> > > > > >> >> > > > > processing
>> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
>> > >> > > > > >> >> > > > >
>> org.apache.hadoop.hbase.master.AssignmentManager:
>> > >> > > Received
>> > >> > > > > SPLIT
>> > >> > > > > >> >> for
>> > >> > > > > >> >> > > > region
>> > >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
>> > >> > > > > >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279
>> but
>> > it
>> > >> > > > doesn't
>> > >> > > > > >> exist
>> > >> > > > > >> >> > > > anymore,
>> > >> > > > > >> >> > > > > probably already processed its split
>> > >> > > > > >> >> > > > >
>> > >> > > > > >> >> > > > > Have you run hbck to repair your cluster ?
>> > >> > > > > >> >> > > > >
>> > >> > > > > >> >> > > > > Thanks
>> > >> > > > > >> >> > > > >
>> > >> > > > > >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
>> > >> > > > > >> mcorgan@hotpads.com>
>> > >> > > > > >> >> > > wrote:
>> > >> > > > > >> >> > > > >
>> > >> > > > > >> >> > > > > > Here's a sample of the master's logs from
>> > >> yesterday.
>> > >> > > >  It's
>> > >> > > > > not
>> > >> > > > > >> >> > > > correlated
>> > >> > > > > >> >> > > > > > exactly with the other pastebin log, but
>> there's
>> > >> 3GB
>> > >> > of
>> > >> > > > > this
>> > >> > > > > >> from
>> > >> > > > > >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
>> > >> > > > > >> >> > > > > >
>> > >> > > > > >> >> > > > > > I'm am pushing the cluster a bit with
>> importing
>> > >> data
>> > >> > so
>> > >> > > > > >> testing
>> > >> > > > > >> >> the
>> > >> > > > > >> >> > > > split
>> > >> > > > > >> >> > > > > > code harder than normal.  The regions are
>> > 500-1GB
>> > >> > > > gzipped.
>> > >> > > > >  I
>> > >> > > > > >> can
>> > >> > > > > >> >> > > look
>> > >> > > > > >> >> > > > > into
>> > >> > > > > >> >> > > > > > it more but trying to figure out what to look
>> > for.
>> > >> > > > > >> >> > > > > >
>> > >> > > > > >> >> > > > > > Thanks Ted,
>> > >> > > > > >> >> > > > > > Matt
>> > >> > > > > >> >> > > > > >
>> > >> > > > > >> >> > > > > >
>> > >> > > > > >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <
>> > >> > > > > yuzhihong@gmail.com>
>> > >> > > > > >> >> > wrote:
>> > >> > > > > >> >> > > > > >
>> > >> > > > > >> >> > > > > > > Matt:
>> > >> > > > > >> >> > > > > > > This is the method which made the logging:
>> > >> > > > > >> >> > > > > > >   private static int
>> > >> > tickleNodeSplit(ZooKeeperWatcher
>> > >> > > > > zkw,
>> > >> > > > > >> >> > > > > > >       HRegionInfo parent, HRegionInfo a,
>> > >> > HRegionInfo
>> > >> > > b,
>> > >> > > > > >> >> > ServerName
>> > >> > > > > >> >> > > > > > > serverName,
>> > >> > > > > >> >> > > > > > >       final int znodeVersion)
>> > >> > > > > >> >> > > > > > >   throws KeeperException, IOException {
>> > >> > > > > >> >> > > > > > >     byte [] payload = Writables.getBytes(a,
>> > b);
>> > >> > > > > >> >> > > > > > >     return ZKAssign.transitionNode(zkw,
>> > parent,
>> > >> > > > > serverName,
>> > >> > > > > >> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
>> > >> > > > > >> >> EventType.RS_ZK_REGION_SPLIT,
>> > >> > > > > >> >> > > > > > >       znodeVersion, payload);
>> > >> > > > > >> >> > > > > > >   }
>> > >> > > > > >> >> > > > > > >
>> > >> > > > > >> >> > > > > > > transitionZKNode() calls tickleNodeSplit()
>> > when
>> > >> > > waiting
>> > >> > > > > for
>> > >> > > > > >> >> > master
>> > >> > > > > >> >> > > to
>> > >> > > > > >> >> > > > > > split
>> > >> > > > > >> >> > > > > > > the region. Obviously something caused the
>> > >> master
>> > >> > not
>> > >> > > > > able
>> > >> > > > > >> to
>> > >> > > > > >> >> > > split.
>> > >> > > > > >> >> > > > > > >
>> > >> > > > > >> >> > > > > > > How large is the region ?
>> > >> > > > > >> >> > > > > > >
>> > >> > > > > >> >> > > > > > > Can you pastebin master log for that
>> period of
>> > >> > time ?
>> > >> > > > > >> >> > > > > > >
>> > >> > > > > >> >> > > > > > > Thanks
>> > >> > > > > >> >> > > > > > >
>> > >> > > > > >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt
>> Corgan <
>> > >> > > > > >> >> mcorgan@hotpads.com
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> > > > > wrote:
>> > >> > > > > >> >> > > > > > >
>> > >> > > > > >> >> > > > > > > > We upgraded from .94.0 to .94.2 last week
>> > and
>> > >> > have
>> > >> > > > > >> started to
>> > >> > > > > >> >> > > > > encounter
>> > >> > > > > >> >> > > > > > > > infinite loops of region-transition on
>> > splits.
>> > >> > I'm
>> > >> > > > not
>> > >> > > > > >> sure
>> > >> > > > > >> >> > yet
>> > >> > > > > >> >> > > if
>> > >> > > > > >> >> > > > > > it's
>> > >> > > > > >> >> > > > > > > > all splits nor if it's related to load.
>> > >> Solution
>> > >> > > so
>> > >> > > > > far
>> > >> > > > > >> has
>> > >> > > > > >> >> > been
>> > >> > > > > >> >> > > > to
>> > >> > > > > >> >> > > > > > > > restart the regionserver process.
>> > >> > > > > >> >> > > > > > > >
>> > >> > > > > >> >> > > > > > > > log snippet:
>> > >> > > > > >> >> > > > > > > > http://pastebin.com/LpienZ7B
>> > >> > > > > >> >> > > > > > > >
>> > >> > > > > >> >> > > > > > > > It's repeating these two lines:
>> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
>> > >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
>> > >> Attempting
>> > >> > to
>> > >> > > > > >> transition
>> > >> > > > > >> >> > > node
>> > >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
>> > >> > > > > RS_ZK_REGION_SPLIT
>> > >> > > > > >> to
>> > >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
>> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
>> > >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
>> > >> Successfully
>> > >> > > > > >> >> transitioned
>> > >> > > > > >> >> > > node
>> > >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
>> > >> > > > > RS_ZK_REGION_SPLIT
>> > >> > > > > >> to
>> > >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
>> > >> > > > > >> >> > > > > > > >
>> > >> > > > > >> >> > > > > > > > with the occasional:
>> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
>> > >> > > > > >> >> > > > > > > >
>> > >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
>> > >> > > > > >> Still
>> > >> > > > > >> >> > > > waiting
>> > >> > > > > >> >> > > > > on
>> > >> > > > > >> >> > > > > > > the
>> > >> > > > > >> >> > > > > > > > master to process the split for
>> > >> > > > > >> >> > cf3e9bc069e1888983c06dc8e053ffcf
>> > >> > > > > >> >> > > > > > > >
>> > >> > > > > >> >> > > > > > > > Should the region transition from
>> > >> > > RS_ZK_REGION_SPLIT
>> > >> > > > to
>> > >> > > > > >> >> itself?
>> > >> > > > > >> >> > >  It
>> > >> > > > > >> >> > > > > > looks
>> > >> > > > > >> >> > > > > > > > wrong, but I'm not familiar with the
>> code at
>> > >> all.
>> > >> > > > > >> >> > > > > > > >
>> > >> > > > > >> >> > > > > > > > Thanks,
>> > >> > > > > >> >> > > > > > > > Matt
>> > >> > > > > >> >> > > > > > > >
>> > >> > > > > >> >> > > > > > >
>> > >> > > > > >> >> > > > > >
>> > >> > > > > >> >> > > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> >
>> > >> > > > > >> >>
>> > >> > > > > >>
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by ramkrishna vasudevan <ra...@gmail.com>.

Raised HBASE-7103 for the same.

Regards
Ram

On Tue, Nov 6, 2012 at 3:37 PM, ramkrishna vasudevan <
ramkrishna.s.vasudevan@gmail.com> wrote:

> Thanks for the logs.
> I found the reason.
>
> The following steps happen
> -> Initially the parent region P1 starts splitting.
> -> The split is going on normally.
> -> Another split starts at the same time for the same region P1. (Not sure
> why this started).
> -> Rollback happens seeing an already existing node.
> -> This node gets deleted in rollback and nodeDeleted Event starts.
> -> In nodeDeleted event the RIT for the region P1 gets deleted.
> -> Because of this there is no region in RIT.
> -> Now the first split gets over.  Here the problem is we try to transit
> the node to SPLITTING to SPLIT. But the node even does not exist.
> But we don take any action on this.  We think it is successful.
> -> Even before HBASE-6854 this could have happened.  Will file a JIRA for
> the same.
>
> Regards
> Ram
>
> On Tue, Nov 6, 2012 at 1:42 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
>> Ram, here's the master log corresponding to http://pastebin.com/cSdMbA2a.
>> Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on the
>> regionserver in question (node 169), so i'm guessing that's
>> 22f8fa73d8af837410ca270f344f6bb8's mommy.
>>
>> btw - you can see my balancer kick in 45 seconds later (runs every 10
>> minutes) here, but so far i think that's coincidence:
>> 2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster:
>> BalanceSwitch=false
>>
>> I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node 169
>> and found all this stuff about a failed split:
>> http://pastebin.com/xtXMZ388 and
>> an attempted rollback.  Looks like it errors out when it goes to put a
>> node
>> in ZK but it's already there.  I'm not familiar with what a good split log
>> looks like, so i'll stop commenting for now...
>>
>>
>> On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan <
>> ramkrishna.s.vasudevan@gmail.com> wrote:
>>
>> > The log shows that the first time the region was transitioned to
>> SPLITTING
>> > even then it was not populated in the Master's memory.
>> >
>> > On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan <
>> > ramkrishna.s.vasudevan@gmail.com> wrote:
>> >
>> > > Could you attach the master logs at this time
>> > > 2012-11-05 00:24:55?
>> > >
>> > > Regards
>> > > Ram
>> > >
>> > > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <lhofhansl@yahoo.com
>> > >wrote:
>> > >
>> > >> Took a brief look through all SPLIT related commits since 0.94.0...
>> > Found
>> > >> these:
>> > >>
>> > >> HBASE-6854 *
>> > >> HBASE-6713
>> > >> HBASE-6329 *
>> > >>
>> > >> HBASE-6088
>> > >>
>> > >> HBASE-5986
>> > >> HBASE-6070 *
>> > >>
>> > >>
>> > >> The ones marked with * are (IMHO) more likely to be related.
>> > >>
>> > >> -- Lars
>> > >>
>> > >> ________________________________
>> > >> From: Matt Corgan <mc...@hotpads.com>
>> > >> To: dev <de...@hbase.apache.org>; lars hofhansl <lh...@yahoo.com>
>> > >> Sent: Monday, November 5, 2012 9:28 PM
>> > >> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>> > >>
>> > >> Yeah - we were running .94.0 since it came out but never saw it
>> there.
>> > >> I'll keep trying to narrow it down.  The only harm it's causing is
>> log
>> > >> spam and failing to move daughters to a new regionserver, which are
>> > >> definitely problems, but it's not bringing down the cluster.
>> > >>
>> > >>
>> > >> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <lh...@yahoo.com>
>> > >> wrote:
>> > >>
>> > >> > So it seems you can repeat this to some extend in 0.94.2, but you
>> have
>> > >> > never seen this before?
>> > >> >
>> > >> >
>> > >> > -- Lars
>> > >> >
>> > >> >
>> > >> >
>> > >> > ________________________________
>> > >> >  From: Matt Corgan <mc...@hotpads.com>
>> > >> > To: dev <de...@hbase.apache.org>
>> > >> > Sent: Monday, November 5, 2012 9:10 PM
>> > >> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>> > >> >
>> > >> > It happened in this new table that I have all the logs for.  The
>> > region
>> > >> in
>> > >> > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the
>> first
>> > >> trace
>> > >> > of this region is on the regionserver with the "Intantiated
>> > >> tableName..."
>> > >> > message
>> > >> >
>> > >> > 2012-11-05 22:24:21,162 DEBUG
>> > >> org.apache.hadoop.hbase.regionserver.HRegion:
>> > >> > Instantiated
>> > >> >
>> > >> >
>> > >>
>> >
>> StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74
>> > >> > 9784c21cbc2.
>> > >> >
>> > >> > I also know this region came from a recent split, but I can't find
>> any
>> > >> log
>> > >> > messages show the parent finishing the split that created this
>> > daughter
>> > >> > region.  So my guess now is that the split is actually finishing
>> and
>> > >> > letting clients continue to write data, but something is failing to
>> > >> print
>> > >> > the log line and correctly tell the master about the new region.
>> > >> >
>> > >> > I've noticed that these regions are showing up on the clients in
>> calls
>> > >> to
>> > >> > HTable.getRegionLocations(), so the clients continue to function,
>> but
>> > >> if I
>> > >> > call HBaseAdmin.move() i get an UnknownRegionException.
>> > >> >
>> > >> >
>> > >> > On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <yu...@gmail.com>
>> wrote:
>> > >> >
>> > >> > > I think what Matt encountered was this:
>> > >> > > https://issues.apache.org/jira/browse/HBASE-7101
>> > >> > >
>> > >> > > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <mcorgan@hotpads.com
>> >
>> > >> wrote:
>> > >> > >
>> > >> > > > Beats me - i can't find any record of it before that.
>> > >> > > >
>> > >> > > > I'm importing data into another table now.  I disabled/enabled
>> the
>> > >> > table
>> > >> > > > first to make sure we have the original 4 region locations
>> logged
>> > >> > > > everywhere.  Will report back...
>> > >> > > >
>> > >> > > >
>> > >> > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <
>> > >> > jdcryans@apache.org
>> > >> > > > >wrote:
>> > >> > > >
>> > >> > > > > Wow, and can you figure how this happened?
>> > >> > > > >
>> > >> > > > > 2012-11-05 00:24:54,538 DEBUG
>> > >> > > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated
>> > >> > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.
>> > >> > > > >
>> > >> > > > > Split? Else how did the master told that RS to open a region
>> > that
>> > >> > > doesn't
>> > >> > > > > exist?
>> > >> > > > >
>> > >> > > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <
>> > mcorgan@hotpads.com>
>> > >> > > wrote:
>> > >> > > > > > Here's one from last night with the master logs at the
>> bottom.
>> > >> > > > > > http://pastebin.com/cSdMbA2a
>> > >> > > > > >
>> > >> > > > > > I don't see that region in the master logs for 4 days back.
>> >  I'm
>> > >> > > going
>> > >> > > > to
>> > >> > > > > > import some new data from scratch soon and will be sure to
>> > keep
>> > >> all
>> > >> > > the
>> > >> > > > > > master logs.
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <
>> > >> > > > jdcryans@apache.org
>> > >> > > > > >wrote:
>> > >> > > > > >
>> > >> > > > > >> This reminds me a lot of
>> > >> > > > > https://issues.apache.org/jira/browse/HBASE-4792
>> > >> > > > > >>
>> > >> > > > > >> What I'd like to see is the log from the first time the
>> > master
>> > >> > > > > >> receives the split message. I guess it says the region
>> > doesn't
>> > >> > exist
>> > >> > > > > >> anymore because the split was processed already in the
>> master
>> > >> but
>> > >> > > > > >> there's a failure mode similar to 4792.
>> > >> > > > > >>
>> > >> > > > > >> I saw this on another cluster last week but the logs were
>> > >> rolling
>> > >> > > > > >> based on size so the original split message was lost.
>> > >> > > > > >>
>> > >> > > > > >> J-D
>> > >> > > > > >>
>> > >> > > > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <
>> > >> mcorgan@hotpads.com>
>> > >> > > > > wrote:
>> > >> > > > > >> > I've had a few successful splits today without any
>> errors.
>> > >> I'll
>> > >> > > > turn
>> > >> > > > > up
>> > >> > > > > >> > the importer speed tomorrow to start stressing the
>> cluster
>> > >> more.
>> > >> > > > > >> >
>> > >> > > > > >> >
>> > >> > > > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <
>> > yuzhihong@gmail.com>
>> > >> > > wrote:
>> > >> > > > > >> >
>> > >> > > > > >> >> Matt:
>> > >> > > > > >> >> Was there any region server crash before you noticed
>> the
>> > >> > repeated
>> > >> > > > log
>> > >> > > > > >> for
>> > >> > > > > >> >> bc62a8a72124a4ba3f6b9f302587903c ?
>> > >> > > > > >> >>
>> > >> > > > > >> >> I wonder if the following JIRA might be related:
>> > >> > > > > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign
>> > missing
>> > >> > > > daughter
>> > >> > > > > >> >>
>> > >> > > > > >> >> Thanks
>> > >> > > > > >> >>
>> > >> > > > > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <
>> > >> > mcorgan@hotpads.com
>> > >> > > >
>> > >> > > > > >> wrote:
>> > >> > > > > >> >>
>> > >> > > > > >> >> > Strangely, I don't see any record of that region in
>> the
>> > >> > master
>> > >> > > > > before
>> > >> > > > > >> >> what
>> > >> > > > > >> >> > I already pasted even though I have logs back to
>> 10/30.
>> > >> Next
>> > >> > > > time
>> > >> > > > > it
>> > >> > > > > >> >> > happens I'll gather a full log record and try to
>> debug
>> > >> while
>> > >> > > it's
>> > >> > > > > >> >> > occurring.
>> > >> > > > > >> >> >
>> > >> > > > > >> >> >
>> > >> > > > > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu
>> > chintaguntla <
>> > >> > > > > >> >> > chrajeshbabu32@gmail.com> wrote:
>> > >> > > > > >> >> >
>> > >> > > > > >> >> > > Hi Matt,
>> > >> > > > > >> >> > > can you paste some more master logs of region
>> > >> > > > > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
>> > >> > > > > >> >> > > I think Its not problem with splitting.
>> > >> > > > > >> >> > > We are getting
>> > >> > > > > >> >> > >       LOG.warn("Region " + encodedName + " not
>> found
>> > on
>> > >> > > server
>> > >> > > > "
>> > >> > > > > +
>> > >> > > > > >> >> > > serverName +
>> > >> > > > > >> >> > >         "; failed processing");
>> > >> > > > > >> >> > > this log means no entry in servers map(not fully
>> > >> assigned).
>> > >> > > > > >> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
>> > >> > > > > >> >> > >     HRegionInfo foundHri = null;
>> > >> > > > > >> >> > >     for (HRegionInfo hri: hris) {
>> > >> > > > > >> >> > >       if
>> (hri.getEncodedName().equals(encodedName)) {
>> > >> > > > > >> >> > >         foundHri = hri;
>> > >> > > > > >> >> > >         break;
>> > >> > > > > >> >> > >       }
>> > >> > > > > >> >> > >     }
>> > >> > > > > >> >> > >     return foundHri;
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <
>> > >> > > > > lhofhansl@yahoo.com>
>> > >> > > > > >> >> > wrote:
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> > > > CC'ing dev list...
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > Is anybody aware of any changes that went in
>> > recently
>> > >> > that
>> > >> > > > > could
>> > >> > > > > >> >> cause
>> > >> > > > > >> >> > > > this?
>> > >> > > > > >> >> > > > I looked around a bit, but could not find
>> anything
>> > >> > obvious.
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > -- Lars
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > ________________________________
>> > >> > > > > >> >> > > >  From: Matt Corgan <mc...@hotpads.com>
>> > >> > > > > >> >> > > > To: user <us...@hbase.apache.org>
>> > >> > > > > >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
>> > >> > > > > >> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT
>> on
>> > >> .94.2
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > I think the cluster is ok without running hbck,
>> as
>> > >> > > restarting
>> > >> > > > > the
>> > >> > > > > >> >> > > > regionserver process stops the warnings and
>> > everything
>> > >> > > looks
>> > >> > > > ok
>> > >> > > > > >> >> > > otherwise.
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > here's the regionserver right after the split
>> > happens:
>> > >> > > > > >> >> > > > ------------------------
>> > >> > > > > >> >> > > > 2012-11-01 22:45:28,726 DEBUG
>> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting
>> to
>> > >> > > transition
>> > >> > > > > node
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
>> > >> > > > > >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
>> > >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
>> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
>> > >> > > > transitioned
>> > >> > > > > >> node
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
>> > >> > > > > >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
>> > >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
>> > >> > > > > >> >> > > >
>> > org.apache.hadoop.hbase.regionserver.SplitTransaction:
>> > >> > > Still
>> > >> > > > > >> waiting
>> > >> > > > > >> >> on
>> > >> > > > > >> >> > > the
>> > >> > > > > >> >> > > > master to process the split for
>> > >> > > > > bc62a8a72124a4ba3f6b9f302587903c
>> > >> > > > > >> >> > > > 2012-11-01 22:45:28,832 DEBUG
>> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting
>> to
>> > >> > > transition
>> > >> > > > > node
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
>> > >> > > > > >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
>> > >> > > > > >> >> > > > 2012-11-01 22:45:28,837 DEBUG
>> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
>> > >> > > > transitioned
>> > >> > > > > >> node
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
>> > >> > > > > >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
>> > >> > > > > >> >> > > > -----------------------------
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to
>> > >> > > > > >> RS_ZK_REGION_SPLIT"
>> > >> > > > > >> >> > > > continues for 15 or so hours and finally settles
>> > >> without
>> > >> > > > manual
>> > >> > > > > >> >> > > > intervention with these regionserver log
>> messages:
>> > >> > > > > >> >> > > > -----------------------
>> > >> > > > > >> >> > > > 2012-11-02 13:55:00,906 DEBUG
>> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting
>> to
>> > >> > > transition
>> > >> > > > > >> node *
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
>> > >> RS_ZK_REGION_SPLIT
>> > >> > > to
>> > >> > > > > >> >> > > > RS_ZK_REGION_SPLIT
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > 2012-11-02 13:55:00,916 DEBUG
>> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
>> > >> > > > transitioned
>> > >> > > > > >> node *
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
>> > >> RS_ZK_REGION_SPLIT
>> > >> > > to
>> > >> > > > > >> >> > > > RS_ZK_REGION_SPLIT
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > 2012-11-02 13:55:00,916 INFO
>> > >> > > > > >> >> > > >
>> org.apache.hadoop.hbase.regionserver.SplitRequest:
>> > >> Region
>> > >> > > > > split,
>> > >> > > > > >> META
>> > >> > > > > >> >> > > > updated, and report to master.
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> >
>> > >> > > > > >> >>
>> > >> > > > > >>
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
>> > >> > > > > >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> >
>> > >> > > > > >> >>
>> > >> > > > > >>
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> >
>> > >> > > > > >> >>
>> > >> > > > > >>
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
>> > >> > > > > >> >> > > > Split took 15hrs, 9mins, 33sec
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > 2012-11-02 13:55:00,945 INFO
>> > >> > > > > >> >> > > >
>> org.apache.hadoop.hbase.regionserver.SplitRequest:
>> > >> Region
>> > >> > > > > split,
>> > >> > > > > >> META
>> > >> > > > > >> >> > > > updated, and report to master.
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> >
>> > >> > > > > >> >>
>> > >> > > > > >>
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
>> > >> > > > > >> >> > > > new regions:
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> >
>> > >> > > > > >> >>
>> > >> > > > > >>
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> >
>> > >> > > > > >>
>> > >> > > >
>> > >> >
>> > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took
>> > 17hrs,
>> > >> > > 14mins,
>> > >> > > > > 2sec
>> > >> > > > > >> >> > > > ----------------------
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > And the master finally logs this:
>> > >> > > > > >> >> > > > -----------------------
>> > >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
>> > >> > > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager:
>> > >> Region
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on
>> server
>> > >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279;
>> > failed
>> > >> > > > > processing
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
>> > >> > > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager:
>> > >> > Received
>> > >> > > > > SPLIT
>> > >> > > > > >> for
>> > >> > > > > >> >> > > region
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
>> > >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279
>> but it
>> > >> > > doesn't
>> > >> > > > > exist
>> > >> > > > > >> >> > > anymore,
>> > >> > > > > >> >> > > > probably already processed its split
>> > >> > > > > >> >> > > > ------------------------
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > I can't find any evidence that the region left
>> the
>> > >> node
>> > >> > > > during
>> > >> > > > > >> that
>> > >> > > > > >> >> > time.
>> > >> > > > > >> >> > > > I'll have to catch it in action next time and see
>> > what
>> > >> > the
>> > >> > > > > region
>> > >> > > > > >> is
>> > >> > > > > >> >> up
>> > >> > > > > >> >> > > to
>> > >> > > > > >> >> > > > during the problem period.
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > What does it mean that it successfully
>> transitioned
>> > >> from
>> > >> > > > SPLIT
>> > >> > > > > to
>> > >> > > > > >> >> > SPLIT?
>> > >> > > > > >> >> > > > Is that a valid transition?
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > Matt
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <
>> > >> > > yuzhihong@gmail.com>
>> > >> > > > > >> wrote:
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > > > > Matt:
>> > >> > > > > >> >> > > > > From the following we can see that region
>> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c
>> > >> > > > > >> >> > > > > cannot be found:
>> > >> > > > > >> >> > > > >
>> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 DEBUG
>> > >> > > > > >> >> > > > >
>> org.apache.hadoop.hbase.master.AssignmentManager:
>> > >> > > Handling
>> > >> > > > > >> >> > > > > transition=RS_ZK_REGION_SPLIT,
>> > >> > > > > >> >> > > > >
>> > >> server=HadoopNode162.hotpads.srv,60020,1351788248279,
>> > >> > > > > >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
>> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
>> > >> > > > > >> >> > > > >
>> org.apache.hadoop.hbase.master.AssignmentManager:
>> > >> > Region
>> > >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on
>> > >> server
>> > >> > > > > >> >> > > > >
>> > >> HadoopNode162.hotpads*.srv,60020,1351788248279;failed
>> > >> > > > > >> >> > > > > processing
>> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
>> > >> > > > > >> >> > > > >
>> org.apache.hadoop.hbase.master.AssignmentManager:
>> > >> > > Received
>> > >> > > > > SPLIT
>> > >> > > > > >> >> for
>> > >> > > > > >> >> > > > region
>> > >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
>> > >> > > > > >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279
>> but
>> > it
>> > >> > > > doesn't
>> > >> > > > > >> exist
>> > >> > > > > >> >> > > > anymore,
>> > >> > > > > >> >> > > > > probably already processed its split
>> > >> > > > > >> >> > > > >
>> > >> > > > > >> >> > > > > Have you run hbck to repair your cluster ?
>> > >> > > > > >> >> > > > >
>> > >> > > > > >> >> > > > > Thanks
>> > >> > > > > >> >> > > > >
>> > >> > > > > >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
>> > >> > > > > >> mcorgan@hotpads.com>
>> > >> > > > > >> >> > > wrote:
>> > >> > > > > >> >> > > > >
>> > >> > > > > >> >> > > > > > Here's a sample of the master's logs from
>> > >> yesterday.
>> > >> > > >  It's
>> > >> > > > > not
>> > >> > > > > >> >> > > > correlated
>> > >> > > > > >> >> > > > > > exactly with the other pastebin log, but
>> there's
>> > >> 3GB
>> > >> > of
>> > >> > > > > this
>> > >> > > > > >> from
>> > >> > > > > >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
>> > >> > > > > >> >> > > > > >
>> > >> > > > > >> >> > > > > > I'm am pushing the cluster a bit with
>> importing
>> > >> data
>> > >> > so
>> > >> > > > > >> testing
>> > >> > > > > >> >> the
>> > >> > > > > >> >> > > > split
>> > >> > > > > >> >> > > > > > code harder than normal.  The regions are
>> > 500-1GB
>> > >> > > > gzipped.
>> > >> > > > >  I
>> > >> > > > > >> can
>> > >> > > > > >> >> > > look
>> > >> > > > > >> >> > > > > into
>> > >> > > > > >> >> > > > > > it more but trying to figure out what to look
>> > for.
>> > >> > > > > >> >> > > > > >
>> > >> > > > > >> >> > > > > > Thanks Ted,
>> > >> > > > > >> >> > > > > > Matt
>> > >> > > > > >> >> > > > > >
>> > >> > > > > >> >> > > > > >
>> > >> > > > > >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <
>> > >> > > > > yuzhihong@gmail.com>
>> > >> > > > > >> >> > wrote:
>> > >> > > > > >> >> > > > > >
>> > >> > > > > >> >> > > > > > > Matt:
>> > >> > > > > >> >> > > > > > > This is the method which made the logging:
>> > >> > > > > >> >> > > > > > >   private static int
>> > >> > tickleNodeSplit(ZooKeeperWatcher
>> > >> > > > > zkw,
>> > >> > > > > >> >> > > > > > >       HRegionInfo parent, HRegionInfo a,
>> > >> > HRegionInfo
>> > >> > > b,
>> > >> > > > > >> >> > ServerName
>> > >> > > > > >> >> > > > > > > serverName,
>> > >> > > > > >> >> > > > > > >       final int znodeVersion)
>> > >> > > > > >> >> > > > > > >   throws KeeperException, IOException {
>> > >> > > > > >> >> > > > > > >     byte [] payload = Writables.getBytes(a,
>> > b);
>> > >> > > > > >> >> > > > > > >     return ZKAssign.transitionNode(zkw,
>> > parent,
>> > >> > > > > serverName,
>> > >> > > > > >> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
>> > >> > > > > >> >> EventType.RS_ZK_REGION_SPLIT,
>> > >> > > > > >> >> > > > > > >       znodeVersion, payload);
>> > >> > > > > >> >> > > > > > >   }
>> > >> > > > > >> >> > > > > > >
>> > >> > > > > >> >> > > > > > > transitionZKNode() calls tickleNodeSplit()
>> > when
>> > >> > > waiting
>> > >> > > > > for
>> > >> > > > > >> >> > master
>> > >> > > > > >> >> > > to
>> > >> > > > > >> >> > > > > > split
>> > >> > > > > >> >> > > > > > > the region. Obviously something caused the
>> > >> master
>> > >> > not
>> > >> > > > > able
>> > >> > > > > >> to
>> > >> > > > > >> >> > > split.
>> > >> > > > > >> >> > > > > > >
>> > >> > > > > >> >> > > > > > > How large is the region ?
>> > >> > > > > >> >> > > > > > >
>> > >> > > > > >> >> > > > > > > Can you pastebin master log for that
>> period of
>> > >> > time ?
>> > >> > > > > >> >> > > > > > >
>> > >> > > > > >> >> > > > > > > Thanks
>> > >> > > > > >> >> > > > > > >
>> > >> > > > > >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt
>> Corgan <
>> > >> > > > > >> >> mcorgan@hotpads.com
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> > > > > wrote:
>> > >> > > > > >> >> > > > > > >
>> > >> > > > > >> >> > > > > > > > We upgraded from .94.0 to .94.2 last week
>> > and
>> > >> > have
>> > >> > > > > >> started to
>> > >> > > > > >> >> > > > > encounter
>> > >> > > > > >> >> > > > > > > > infinite loops of region-transition on
>> > splits.
>> > >> > I'm
>> > >> > > > not
>> > >> > > > > >> sure
>> > >> > > > > >> >> > yet
>> > >> > > > > >> >> > > if
>> > >> > > > > >> >> > > > > > it's
>> > >> > > > > >> >> > > > > > > > all splits nor if it's related to load.
>> > >> Solution
>> > >> > > so
>> > >> > > > > far
>> > >> > > > > >> has
>> > >> > > > > >> >> > been
>> > >> > > > > >> >> > > > to
>> > >> > > > > >> >> > > > > > > > restart the regionserver process.
>> > >> > > > > >> >> > > > > > > >
>> > >> > > > > >> >> > > > > > > > log snippet:
>> > >> > > > > >> >> > > > > > > > http://pastebin.com/LpienZ7B
>> > >> > > > > >> >> > > > > > > >
>> > >> > > > > >> >> > > > > > > > It's repeating these two lines:
>> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
>> > >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
>> > >> Attempting
>> > >> > to
>> > >> > > > > >> transition
>> > >> > > > > >> >> > > node
>> > >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
>> > >> > > > > RS_ZK_REGION_SPLIT
>> > >> > > > > >> to
>> > >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
>> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
>> > >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
>> > >> Successfully
>> > >> > > > > >> >> transitioned
>> > >> > > > > >> >> > > node
>> > >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
>> > >> > > > > RS_ZK_REGION_SPLIT
>> > >> > > > > >> to
>> > >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
>> > >> > > > > >> >> > > > > > > >
>> > >> > > > > >> >> > > > > > > > with the occasional:
>> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
>> > >> > > > > >> >> > > > > > > >
>> > >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
>> > >> > > > > >> Still
>> > >> > > > > >> >> > > > waiting
>> > >> > > > > >> >> > > > > on
>> > >> > > > > >> >> > > > > > > the
>> > >> > > > > >> >> > > > > > > > master to process the split for
>> > >> > > > > >> >> > cf3e9bc069e1888983c06dc8e053ffcf
>> > >> > > > > >> >> > > > > > > >
>> > >> > > > > >> >> > > > > > > > Should the region transition from
>> > >> > > RS_ZK_REGION_SPLIT
>> > >> > > > to
>> > >> > > > > >> >> itself?
>> > >> > > > > >> >> > >  It
>> > >> > > > > >> >> > > > > > looks
>> > >> > > > > >> >> > > > > > > > wrong, but I'm not familiar with the
>> code at
>> > >> all.
>> > >> > > > > >> >> > > > > > > >
>> > >> > > > > >> >> > > > > > > > Thanks,
>> > >> > > > > >> >> > > > > > > > Matt
>> > >> > > > > >> >> > > > > > > >
>> > >> > > > > >> >> > > > > > >
>> > >> > > > > >> >> > > > > >
>> > >> > > > > >> >> > > > >
>> > >> > > > > >> >> > > >
>> > >> > > > > >> >> > >
>> > >> > > > > >> >> >
>> > >> > > > > >> >>
>> > >> > > > > >>
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by ramkrishna vasudevan <ra...@gmail.com>.

Thanks for the logs.
I found the reason.

The following steps happen
-> Initially the parent region P1 starts splitting.
-> The split is going on normally.
-> Another split starts at the same time for the same region P1. (Not sure
why this started).
-> Rollback happens seeing an already existing node.
-> This node gets deleted in rollback and nodeDeleted Event starts.
-> In nodeDeleted event the RIT for the region P1 gets deleted.
-> Because of this there is no region in RIT.
-> Now the first split gets over.  Here the problem is we try to transit
the node to SPLITTING to SPLIT. But the node even does not exist.
But we don take any action on this.  We think it is successful.
-> Even before HBASE-6854 this could have happened.  Will file a JIRA for
the same.

Regards
Ram
On Tue, Nov 6, 2012 at 1:42 PM, Matt Corgan <mc...@hotpads.com> wrote:

> Ram, here's the master log corresponding to http://pastebin.com/cSdMbA2a.
> Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on the
> regionserver in question (node 169), so i'm guessing that's
> 22f8fa73d8af837410ca270f344f6bb8's mommy.
>
> btw - you can see my balancer kick in 45 seconds later (runs every 10
> minutes) here, but so far i think that's coincidence:
> 2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster:
> BalanceSwitch=false
>
> I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node 169
> and found all this stuff about a failed split:
> http://pastebin.com/xtXMZ388 and
> an attempted rollback.  Looks like it errors out when it goes to put a node
> in ZK but it's already there.  I'm not familiar with what a good split log
> looks like, so i'll stop commenting for now...
>
>
> On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan <
> ramkrishna.s.vasudevan@gmail.com> wrote:
>
> > The log shows that the first time the region was transitioned to
> SPLITTING
> > even then it was not populated in the Master's memory.
> >
> > On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan <
> > ramkrishna.s.vasudevan@gmail.com> wrote:
> >
> > > Could you attach the master logs at this time
> > > 2012-11-05 00:24:55?
> > >
> > > Regards
> > > Ram
> > >
> > > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <lhofhansl@yahoo.com
> > >wrote:
> > >
> > >> Took a brief look through all SPLIT related commits since 0.94.0...
> > Found
> > >> these:
> > >>
> > >> HBASE-6854 *
> > >> HBASE-6713
> > >> HBASE-6329 *
> > >>
> > >> HBASE-6088
> > >>
> > >> HBASE-5986
> > >> HBASE-6070 *
> > >>
> > >>
> > >> The ones marked with * are (IMHO) more likely to be related.
> > >>
> > >> -- Lars
> > >>
> > >> ________________________________
> > >> From: Matt Corgan <mc...@hotpads.com>
> > >> To: dev <de...@hbase.apache.org>; lars hofhansl <lh...@yahoo.com>
> > >> Sent: Monday, November 5, 2012 9:28 PM
> > >> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> > >>
> > >> Yeah - we were running .94.0 since it came out but never saw it there.
> > >> I'll keep trying to narrow it down.  The only harm it's causing is log
> > >> spam and failing to move daughters to a new regionserver, which are
> > >> definitely problems, but it's not bringing down the cluster.
> > >>
> > >>
> > >> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <lh...@yahoo.com>
> > >> wrote:
> > >>
> > >> > So it seems you can repeat this to some extend in 0.94.2, but you
> have
> > >> > never seen this before?
> > >> >
> > >> >
> > >> > -- Lars
> > >> >
> > >> >
> > >> >
> > >> > ________________________________
> > >> >  From: Matt Corgan <mc...@hotpads.com>
> > >> > To: dev <de...@hbase.apache.org>
> > >> > Sent: Monday, November 5, 2012 9:10 PM
> > >> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> > >> >
> > >> > It happened in this new table that I have all the logs for.  The
> > region
> > >> in
> > >> > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the
> first
> > >> trace
> > >> > of this region is on the regionserver with the "Intantiated
> > >> tableName..."
> > >> > message
> > >> >
> > >> > 2012-11-05 22:24:21,162 DEBUG
> > >> org.apache.hadoop.hbase.regionserver.HRegion:
> > >> > Instantiated
> > >> >
> > >> >
> > >>
> >
> StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74
> > >> > 9784c21cbc2.
> > >> >
> > >> > I also know this region came from a recent split, but I can't find
> any
> > >> log
> > >> > messages show the parent finishing the split that created this
> > daughter
> > >> > region.  So my guess now is that the split is actually finishing and
> > >> > letting clients continue to write data, but something is failing to
> > >> print
> > >> > the log line and correctly tell the master about the new region.
> > >> >
> > >> > I've noticed that these regions are showing up on the clients in
> calls
> > >> to
> > >> > HTable.getRegionLocations(), so the clients continue to function,
> but
> > >> if I
> > >> > call HBaseAdmin.move() i get an UnknownRegionException.
> > >> >
> > >> >
> > >> > On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <yu...@gmail.com> wrote:
> > >> >
> > >> > > I think what Matt encountered was this:
> > >> > > https://issues.apache.org/jira/browse/HBASE-7101
> > >> > >
> > >> > > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <mc...@hotpads.com>
> > >> wrote:
> > >> > >
> > >> > > > Beats me - i can't find any record of it before that.
> > >> > > >
> > >> > > > I'm importing data into another table now.  I disabled/enabled
> the
> > >> > table
> > >> > > > first to make sure we have the original 4 region locations
> logged
> > >> > > > everywhere.  Will report back...
> > >> > > >
> > >> > > >
> > >> > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <
> > >> > jdcryans@apache.org
> > >> > > > >wrote:
> > >> > > >
> > >> > > > > Wow, and can you figure how this happened?
> > >> > > > >
> > >> > > > > 2012-11-05 00:24:54,538 DEBUG
> > >> > > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.
> > >> > > > >
> > >> > > > > Split? Else how did the master told that RS to open a region
> > that
> > >> > > doesn't
> > >> > > > > exist?
> > >> > > > >
> > >> > > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <
> > mcorgan@hotpads.com>
> > >> > > wrote:
> > >> > > > > > Here's one from last night with the master logs at the
> bottom.
> > >> > > > > > http://pastebin.com/cSdMbA2a
> > >> > > > > >
> > >> > > > > > I don't see that region in the master logs for 4 days back.
> >  I'm
> > >> > > going
> > >> > > > to
> > >> > > > > > import some new data from scratch soon and will be sure to
> > keep
> > >> all
> > >> > > the
> > >> > > > > > master logs.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <
> > >> > > > jdcryans@apache.org
> > >> > > > > >wrote:
> > >> > > > > >
> > >> > > > > >> This reminds me a lot of
> > >> > > > > https://issues.apache.org/jira/browse/HBASE-4792
> > >> > > > > >>
> > >> > > > > >> What I'd like to see is the log from the first time the
> > master
> > >> > > > > >> receives the split message. I guess it says the region
> > doesn't
> > >> > exist
> > >> > > > > >> anymore because the split was processed already in the
> master
> > >> but
> > >> > > > > >> there's a failure mode similar to 4792.
> > >> > > > > >>
> > >> > > > > >> I saw this on another cluster last week but the logs were
> > >> rolling
> > >> > > > > >> based on size so the original split message was lost.
> > >> > > > > >>
> > >> > > > > >> J-D
> > >> > > > > >>
> > >> > > > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <
> > >> mcorgan@hotpads.com>
> > >> > > > > wrote:
> > >> > > > > >> > I've had a few successful splits today without any
> errors.
> > >> I'll
> > >> > > > turn
> > >> > > > > up
> > >> > > > > >> > the importer speed tomorrow to start stressing the
> cluster
> > >> more.
> > >> > > > > >> >
> > >> > > > > >> >
> > >> > > > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <
> > yuzhihong@gmail.com>
> > >> > > wrote:
> > >> > > > > >> >
> > >> > > > > >> >> Matt:
> > >> > > > > >> >> Was there any region server crash before you noticed the
> > >> > repeated
> > >> > > > log
> > >> > > > > >> for
> > >> > > > > >> >> bc62a8a72124a4ba3f6b9f302587903c ?
> > >> > > > > >> >>
> > >> > > > > >> >> I wonder if the following JIRA might be related:
> > >> > > > > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign
> > missing
> > >> > > > daughter
> > >> > > > > >> >>
> > >> > > > > >> >> Thanks
> > >> > > > > >> >>
> > >> > > > > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <
> > >> > mcorgan@hotpads.com
> > >> > > >
> > >> > > > > >> wrote:
> > >> > > > > >> >>
> > >> > > > > >> >> > Strangely, I don't see any record of that region in
> the
> > >> > master
> > >> > > > > before
> > >> > > > > >> >> what
> > >> > > > > >> >> > I already pasted even though I have logs back to
> 10/30.
> > >> Next
> > >> > > > time
> > >> > > > > it
> > >> > > > > >> >> > happens I'll gather a full log record and try to debug
> > >> while
> > >> > > it's
> > >> > > > > >> >> > occurring.
> > >> > > > > >> >> >
> > >> > > > > >> >> >
> > >> > > > > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu
> > chintaguntla <
> > >> > > > > >> >> > chrajeshbabu32@gmail.com> wrote:
> > >> > > > > >> >> >
> > >> > > > > >> >> > > Hi Matt,
> > >> > > > > >> >> > > can you paste some more master logs of region
> > >> > > > > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
> > >> > > > > >> >> > > I think Its not problem with splitting.
> > >> > > > > >> >> > > We are getting
> > >> > > > > >> >> > >       LOG.warn("Region " + encodedName + " not found
> > on
> > >> > > server
> > >> > > > "
> > >> > > > > +
> > >> > > > > >> >> > > serverName +
> > >> > > > > >> >> > >         "; failed processing");
> > >> > > > > >> >> > > this log means no entry in servers map(not fully
> > >> assigned).
> > >> > > > > >> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
> > >> > > > > >> >> > >     HRegionInfo foundHri = null;
> > >> > > > > >> >> > >     for (HRegionInfo hri: hris) {
> > >> > > > > >> >> > >       if (hri.getEncodedName().equals(encodedName))
> {
> > >> > > > > >> >> > >         foundHri = hri;
> > >> > > > > >> >> > >         break;
> > >> > > > > >> >> > >       }
> > >> > > > > >> >> > >     }
> > >> > > > > >> >> > >     return foundHri;
> > >> > > > > >> >> > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <
> > >> > > > > lhofhansl@yahoo.com>
> > >> > > > > >> >> > wrote:
> > >> > > > > >> >> > >
> > >> > > > > >> >> > > > CC'ing dev list...
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > Is anybody aware of any changes that went in
> > recently
> > >> > that
> > >> > > > > could
> > >> > > > > >> >> cause
> > >> > > > > >> >> > > > this?
> > >> > > > > >> >> > > > I looked around a bit, but could not find anything
> > >> > obvious.
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > -- Lars
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > ________________________________
> > >> > > > > >> >> > > >  From: Matt Corgan <mc...@hotpads.com>
> > >> > > > > >> >> > > > To: user <us...@hbase.apache.org>
> > >> > > > > >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
> > >> > > > > >> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT
> on
> > >> .94.2
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > I think the cluster is ok without running hbck, as
> > >> > > restarting
> > >> > > > > the
> > >> > > > > >> >> > > > regionserver process stops the warnings and
> > everything
> > >> > > looks
> > >> > > > ok
> > >> > > > > >> >> > > otherwise.
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > here's the regionserver right after the split
> > happens:
> > >> > > > > >> >> > > > ------------------------
> > >> > > > > >> >> > > > 2012-11-01 22:45:28,726 DEBUG
> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> > >> > > transition
> > >> > > > > node
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> > >> > > > > >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> > >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > >> > > > transitioned
> > >> > > > > >> node
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > >> > > > > >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> > >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> > >> > > > > >> >> > > >
> > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> > >> > > Still
> > >> > > > > >> waiting
> > >> > > > > >> >> on
> > >> > > > > >> >> > > the
> > >> > > > > >> >> > > > master to process the split for
> > >> > > > > bc62a8a72124a4ba3f6b9f302587903c
> > >> > > > > >> >> > > > 2012-11-01 22:45:28,832 DEBUG
> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> > >> > > transition
> > >> > > > > node
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> > >> > > > > >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
> > >> > > > > >> >> > > > 2012-11-01 22:45:28,837 DEBUG
> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > >> > > > transitioned
> > >> > > > > >> node
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > >> > > > > >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
> > >> > > > > >> >> > > > -----------------------------
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to
> > >> > > > > >> RS_ZK_REGION_SPLIT"
> > >> > > > > >> >> > > > continues for 15 or so hours and finally settles
> > >> without
> > >> > > > manual
> > >> > > > > >> >> > > > intervention with these regionserver log messages:
> > >> > > > > >> >> > > > -----------------------
> > >> > > > > >> >> > > > 2012-11-02 13:55:00,906 DEBUG
> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> > >> > > transition
> > >> > > > > >> node *
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
> > >> RS_ZK_REGION_SPLIT
> > >> > > to
> > >> > > > > >> >> > > > RS_ZK_REGION_SPLIT
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > 2012-11-02 13:55:00,916 DEBUG
> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > >> > > > transitioned
> > >> > > > > >> node *
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
> > >> RS_ZK_REGION_SPLIT
> > >> > > to
> > >> > > > > >> >> > > > RS_ZK_REGION_SPLIT
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > 2012-11-02 13:55:00,916 INFO
> > >> > > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest:
> > >> Region
> > >> > > > > split,
> > >> > > > > >> META
> > >> > > > > >> >> > > > updated, and report to master.
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> >
> > >> > > > > >> >>
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> > >> > > > > >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> >
> > >> > > > > >> >>
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> >
> > >> > > > > >> >>
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> > >> > > > > >> >> > > > Split took 15hrs, 9mins, 33sec
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > 2012-11-02 13:55:00,945 INFO
> > >> > > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest:
> > >> Region
> > >> > > > > split,
> > >> > > > > >> META
> > >> > > > > >> >> > > > updated, and report to master.
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> >
> > >> > > > > >> >>
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> > >> > > > > >> >> > > > new regions:
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> >
> > >> > > > > >> >>
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> > >> > > > > >> >> > > >
> > >> > > > > >> >> >
> > >> > > > > >>
> > >> > > >
> > >> >
> > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took
> > 17hrs,
> > >> > > 14mins,
> > >> > > > > 2sec
> > >> > > > > >> >> > > > ----------------------
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > And the master finally logs this:
> > >> > > > > >> >> > > > -----------------------
> > >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> > >> > > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager:
> > >> Region
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on
> server
> > >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279;
> > failed
> > >> > > > > processing
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> > >> > > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager:
> > >> > Received
> > >> > > > > SPLIT
> > >> > > > > >> for
> > >> > > > > >> >> > > region
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but
> it
> > >> > > doesn't
> > >> > > > > exist
> > >> > > > > >> >> > > anymore,
> > >> > > > > >> >> > > > probably already processed its split
> > >> > > > > >> >> > > > ------------------------
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > I can't find any evidence that the region left the
> > >> node
> > >> > > > during
> > >> > > > > >> that
> > >> > > > > >> >> > time.
> > >> > > > > >> >> > > > I'll have to catch it in action next time and see
> > what
> > >> > the
> > >> > > > > region
> > >> > > > > >> is
> > >> > > > > >> >> up
> > >> > > > > >> >> > > to
> > >> > > > > >> >> > > > during the problem period.
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > What does it mean that it successfully
> transitioned
> > >> from
> > >> > > > SPLIT
> > >> > > > > to
> > >> > > > > >> >> > SPLIT?
> > >> > > > > >> >> > > > Is that a valid transition?
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > Matt
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <
> > >> > > yuzhihong@gmail.com>
> > >> > > > > >> wrote:
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > > Matt:
> > >> > > > > >> >> > > > > From the following we can see that region
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c
> > >> > > > > >> >> > > > > cannot be found:
> > >> > > > > >> >> > > > >
> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 DEBUG
> > >> > > > > >> >> > > > >
> org.apache.hadoop.hbase.master.AssignmentManager:
> > >> > > Handling
> > >> > > > > >> >> > > > > transition=RS_ZK_REGION_SPLIT,
> > >> > > > > >> >> > > > >
> > >> server=HadoopNode162.hotpads.srv,60020,1351788248279,
> > >> > > > > >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> > >> > > > > >> >> > > > >
> org.apache.hadoop.hbase.master.AssignmentManager:
> > >> > Region
> > >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on
> > >> server
> > >> > > > > >> >> > > > >
> > >> HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> > >> > > > > >> >> > > > > processing
> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> > >> > > > > >> >> > > > >
> org.apache.hadoop.hbase.master.AssignmentManager:
> > >> > > Received
> > >> > > > > SPLIT
> > >> > > > > >> >> for
> > >> > > > > >> >> > > > region
> > >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > >> > > > > >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279
> but
> > it
> > >> > > > doesn't
> > >> > > > > >> exist
> > >> > > > > >> >> > > > anymore,
> > >> > > > > >> >> > > > > probably already processed its split
> > >> > > > > >> >> > > > >
> > >> > > > > >> >> > > > > Have you run hbck to repair your cluster ?
> > >> > > > > >> >> > > > >
> > >> > > > > >> >> > > > > Thanks
> > >> > > > > >> >> > > > >
> > >> > > > > >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
> > >> > > > > >> mcorgan@hotpads.com>
> > >> > > > > >> >> > > wrote:
> > >> > > > > >> >> > > > >
> > >> > > > > >> >> > > > > > Here's a sample of the master's logs from
> > >> yesterday.
> > >> > > >  It's
> > >> > > > > not
> > >> > > > > >> >> > > > correlated
> > >> > > > > >> >> > > > > > exactly with the other pastebin log, but
> there's
> > >> 3GB
> > >> > of
> > >> > > > > this
> > >> > > > > >> from
> > >> > > > > >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
> > >> > > > > >> >> > > > > >
> > >> > > > > >> >> > > > > > I'm am pushing the cluster a bit with
> importing
> > >> data
> > >> > so
> > >> > > > > >> testing
> > >> > > > > >> >> the
> > >> > > > > >> >> > > > split
> > >> > > > > >> >> > > > > > code harder than normal.  The regions are
> > 500-1GB
> > >> > > > gzipped.
> > >> > > > >  I
> > >> > > > > >> can
> > >> > > > > >> >> > > look
> > >> > > > > >> >> > > > > into
> > >> > > > > >> >> > > > > > it more but trying to figure out what to look
> > for.
> > >> > > > > >> >> > > > > >
> > >> > > > > >> >> > > > > > Thanks Ted,
> > >> > > > > >> >> > > > > > Matt
> > >> > > > > >> >> > > > > >
> > >> > > > > >> >> > > > > >
> > >> > > > > >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <
> > >> > > > > yuzhihong@gmail.com>
> > >> > > > > >> >> > wrote:
> > >> > > > > >> >> > > > > >
> > >> > > > > >> >> > > > > > > Matt:
> > >> > > > > >> >> > > > > > > This is the method which made the logging:
> > >> > > > > >> >> > > > > > >   private static int
> > >> > tickleNodeSplit(ZooKeeperWatcher
> > >> > > > > zkw,
> > >> > > > > >> >> > > > > > >       HRegionInfo parent, HRegionInfo a,
> > >> > HRegionInfo
> > >> > > b,
> > >> > > > > >> >> > ServerName
> > >> > > > > >> >> > > > > > > serverName,
> > >> > > > > >> >> > > > > > >       final int znodeVersion)
> > >> > > > > >> >> > > > > > >   throws KeeperException, IOException {
> > >> > > > > >> >> > > > > > >     byte [] payload = Writables.getBytes(a,
> > b);
> > >> > > > > >> >> > > > > > >     return ZKAssign.transitionNode(zkw,
> > parent,
> > >> > > > > serverName,
> > >> > > > > >> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
> > >> > > > > >> >> EventType.RS_ZK_REGION_SPLIT,
> > >> > > > > >> >> > > > > > >       znodeVersion, payload);
> > >> > > > > >> >> > > > > > >   }
> > >> > > > > >> >> > > > > > >
> > >> > > > > >> >> > > > > > > transitionZKNode() calls tickleNodeSplit()
> > when
> > >> > > waiting
> > >> > > > > for
> > >> > > > > >> >> > master
> > >> > > > > >> >> > > to
> > >> > > > > >> >> > > > > > split
> > >> > > > > >> >> > > > > > > the region. Obviously something caused the
> > >> master
> > >> > not
> > >> > > > > able
> > >> > > > > >> to
> > >> > > > > >> >> > > split.
> > >> > > > > >> >> > > > > > >
> > >> > > > > >> >> > > > > > > How large is the region ?
> > >> > > > > >> >> > > > > > >
> > >> > > > > >> >> > > > > > > Can you pastebin master log for that period
> of
> > >> > time ?
> > >> > > > > >> >> > > > > > >
> > >> > > > > >> >> > > > > > > Thanks
> > >> > > > > >> >> > > > > > >
> > >> > > > > >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan
> <
> > >> > > > > >> >> mcorgan@hotpads.com
> > >> > > > > >> >> > >
> > >> > > > > >> >> > > > > wrote:
> > >> > > > > >> >> > > > > > >
> > >> > > > > >> >> > > > > > > > We upgraded from .94.0 to .94.2 last week
> > and
> > >> > have
> > >> > > > > >> started to
> > >> > > > > >> >> > > > > encounter
> > >> > > > > >> >> > > > > > > > infinite loops of region-transition on
> > splits.
> > >> > I'm
> > >> > > > not
> > >> > > > > >> sure
> > >> > > > > >> >> > yet
> > >> > > > > >> >> > > if
> > >> > > > > >> >> > > > > > it's
> > >> > > > > >> >> > > > > > > > all splits nor if it's related to load.
> > >> Solution
> > >> > > so
> > >> > > > > far
> > >> > > > > >> has
> > >> > > > > >> >> > been
> > >> > > > > >> >> > > > to
> > >> > > > > >> >> > > > > > > > restart the regionserver process.
> > >> > > > > >> >> > > > > > > >
> > >> > > > > >> >> > > > > > > > log snippet:
> > >> > > > > >> >> > > > > > > > http://pastebin.com/LpienZ7B
> > >> > > > > >> >> > > > > > > >
> > >> > > > > >> >> > > > > > > > It's repeating these two lines:
> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
> > >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
> > >> Attempting
> > >> > to
> > >> > > > > >> transition
> > >> > > > > >> >> > > node
> > >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> > >> > > > > RS_ZK_REGION_SPLIT
> > >> > > > > >> to
> > >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
> > >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
> > >> Successfully
> > >> > > > > >> >> transitioned
> > >> > > > > >> >> > > node
> > >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> > >> > > > > RS_ZK_REGION_SPLIT
> > >> > > > > >> to
> > >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> > >> > > > > >> >> > > > > > > >
> > >> > > > > >> >> > > > > > > > with the occasional:
> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
> > >> > > > > >> >> > > > > > > >
> > >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> > >> > > > > >> Still
> > >> > > > > >> >> > > > waiting
> > >> > > > > >> >> > > > > on
> > >> > > > > >> >> > > > > > > the
> > >> > > > > >> >> > > > > > > > master to process the split for
> > >> > > > > >> >> > cf3e9bc069e1888983c06dc8e053ffcf
> > >> > > > > >> >> > > > > > > >
> > >> > > > > >> >> > > > > > > > Should the region transition from
> > >> > > RS_ZK_REGION_SPLIT
> > >> > > > to
> > >> > > > > >> >> itself?
> > >> > > > > >> >> > >  It
> > >> > > > > >> >> > > > > > looks
> > >> > > > > >> >> > > > > > > > wrong, but I'm not familiar with the code
> at
> > >> all.
> > >> > > > > >> >> > > > > > > >
> > >> > > > > >> >> > > > > > > > Thanks,
> > >> > > > > >> >> > > > > > > > Matt
> > >> > > > > >> >> > > > > > > >
> > >> > > > > >> >> > > > > > >
> > >> > > > > >> >> > > > > >
> > >> > > > > >> >> > > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> >
> > >> > > > > >> >>
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by Matt Corgan <mc...@hotpads.com>.

Ram, here's the master log corresponding to http://pastebin.com/cSdMbA2a.
Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on the
regionserver in question (node 169), so i'm guessing that's
22f8fa73d8af837410ca270f344f6bb8's mommy.

btw - you can see my balancer kick in 45 seconds later (runs every 10
minutes) here, but so far i think that's coincidence:
2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster:
BalanceSwitch=false

I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node 169
and found all this stuff about a failed split: http://pastebin.com/xtXMZ388 and
an attempted rollback.  Looks like it errors out when it goes to put a node
in ZK but it's already there.  I'm not familiar with what a good split log
looks like, so i'll stop commenting for now...


On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan <
ramkrishna.s.vasudevan@gmail.com> wrote:

> The log shows that the first time the region was transitioned to SPLITTING
> even then it was not populated in the Master's memory.
>
> On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan <
> ramkrishna.s.vasudevan@gmail.com> wrote:
>
> > Could you attach the master logs at this time
> > 2012-11-05 00:24:55?
> >
> > Regards
> > Ram
> >
> > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <lhofhansl@yahoo.com
> >wrote:
> >
> >> Took a brief look through all SPLIT related commits since 0.94.0...
> Found
> >> these:
> >>
> >> HBASE-6854 *
> >> HBASE-6713
> >> HBASE-6329 *
> >>
> >> HBASE-6088
> >>
> >> HBASE-5986
> >> HBASE-6070 *
> >>
> >>
> >> The ones marked with * are (IMHO) more likely to be related.
> >>
> >> -- Lars
> >>
> >> ________________________________
> >> From: Matt Corgan <mc...@hotpads.com>
> >> To: dev <de...@hbase.apache.org>; lars hofhansl <lh...@yahoo.com>
> >> Sent: Monday, November 5, 2012 9:28 PM
> >> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> >>
> >> Yeah - we were running .94.0 since it came out but never saw it there.
> >> I'll keep trying to narrow it down.  The only harm it's causing is log
> >> spam and failing to move daughters to a new regionserver, which are
> >> definitely problems, but it's not bringing down the cluster.
> >>
> >>
> >> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <lh...@yahoo.com>
> >> wrote:
> >>
> >> > So it seems you can repeat this to some extend in 0.94.2, but you have
> >> > never seen this before?
> >> >
> >> >
> >> > -- Lars
> >> >
> >> >
> >> >
> >> > ________________________________
> >> >  From: Matt Corgan <mc...@hotpads.com>
> >> > To: dev <de...@hbase.apache.org>
> >> > Sent: Monday, November 5, 2012 9:10 PM
> >> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> >> >
> >> > It happened in this new table that I have all the logs for.  The
> region
> >> in
> >> > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the first
> >> trace
> >> > of this region is on the regionserver with the "Intantiated
> >> tableName..."
> >> > message
> >> >
> >> > 2012-11-05 22:24:21,162 DEBUG
> >> org.apache.hadoop.hbase.regionserver.HRegion:
> >> > Instantiated
> >> >
> >> >
> >>
> StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74
> >> > 9784c21cbc2.
> >> >
> >> > I also know this region came from a recent split, but I can't find any
> >> log
> >> > messages show the parent finishing the split that created this
> daughter
> >> > region.  So my guess now is that the split is actually finishing and
> >> > letting clients continue to write data, but something is failing to
> >> print
> >> > the log line and correctly tell the master about the new region.
> >> >
> >> > I've noticed that these regions are showing up on the clients in calls
> >> to
> >> > HTable.getRegionLocations(), so the clients continue to function, but
> >> if I
> >> > call HBaseAdmin.move() i get an UnknownRegionException.
> >> >
> >> >
> >> > On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <yu...@gmail.com> wrote:
> >> >
> >> > > I think what Matt encountered was this:
> >> > > https://issues.apache.org/jira/browse/HBASE-7101
> >> > >
> >> > > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <mc...@hotpads.com>
> >> wrote:
> >> > >
> >> > > > Beats me - i can't find any record of it before that.
> >> > > >
> >> > > > I'm importing data into another table now.  I disabled/enabled the
> >> > table
> >> > > > first to make sure we have the original 4 region locations logged
> >> > > > everywhere.  Will report back...
> >> > > >
> >> > > >
> >> > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <
> >> > jdcryans@apache.org
> >> > > > >wrote:
> >> > > >
> >> > > > > Wow, and can you figure how this happened?
> >> > > > >
> >> > > > > 2012-11-05 00:24:54,538 DEBUG
> >> > > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.
> >> > > > >
> >> > > > > Split? Else how did the master told that RS to open a region
> that
> >> > > doesn't
> >> > > > > exist?
> >> > > > >
> >> > > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <
> mcorgan@hotpads.com>
> >> > > wrote:
> >> > > > > > Here's one from last night with the master logs at the bottom.
> >> > > > > > http://pastebin.com/cSdMbA2a
> >> > > > > >
> >> > > > > > I don't see that region in the master logs for 4 days back.
>  I'm
> >> > > going
> >> > > > to
> >> > > > > > import some new data from scratch soon and will be sure to
> keep
> >> all
> >> > > the
> >> > > > > > master logs.
> >> > > > > >
> >> > > > > >
> >> > > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <
> >> > > > jdcryans@apache.org
> >> > > > > >wrote:
> >> > > > > >
> >> > > > > >> This reminds me a lot of
> >> > > > > https://issues.apache.org/jira/browse/HBASE-4792
> >> > > > > >>
> >> > > > > >> What I'd like to see is the log from the first time the
> master
> >> > > > > >> receives the split message. I guess it says the region
> doesn't
> >> > exist
> >> > > > > >> anymore because the split was processed already in the master
> >> but
> >> > > > > >> there's a failure mode similar to 4792.
> >> > > > > >>
> >> > > > > >> I saw this on another cluster last week but the logs were
> >> rolling
> >> > > > > >> based on size so the original split message was lost.
> >> > > > > >>
> >> > > > > >> J-D
> >> > > > > >>
> >> > > > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <
> >> mcorgan@hotpads.com>
> >> > > > > wrote:
> >> > > > > >> > I've had a few successful splits today without any errors.
> >> I'll
> >> > > > turn
> >> > > > > up
> >> > > > > >> > the importer speed tomorrow to start stressing the cluster
> >> more.
> >> > > > > >> >
> >> > > > > >> >
> >> > > > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <
> yuzhihong@gmail.com>
> >> > > wrote:
> >> > > > > >> >
> >> > > > > >> >> Matt:
> >> > > > > >> >> Was there any region server crash before you noticed the
> >> > repeated
> >> > > > log
> >> > > > > >> for
> >> > > > > >> >> bc62a8a72124a4ba3f6b9f302587903c ?
> >> > > > > >> >>
> >> > > > > >> >> I wonder if the following JIRA might be related:
> >> > > > > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign
> missing
> >> > > > daughter
> >> > > > > >> >>
> >> > > > > >> >> Thanks
> >> > > > > >> >>
> >> > > > > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <
> >> > mcorgan@hotpads.com
> >> > > >
> >> > > > > >> wrote:
> >> > > > > >> >>
> >> > > > > >> >> > Strangely, I don't see any record of that region in the
> >> > master
> >> > > > > before
> >> > > > > >> >> what
> >> > > > > >> >> > I already pasted even though I have logs back to 10/30.
> >> Next
> >> > > > time
> >> > > > > it
> >> > > > > >> >> > happens I'll gather a full log record and try to debug
> >> while
> >> > > it's
> >> > > > > >> >> > occurring.
> >> > > > > >> >> >
> >> > > > > >> >> >
> >> > > > > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu
> chintaguntla <
> >> > > > > >> >> > chrajeshbabu32@gmail.com> wrote:
> >> > > > > >> >> >
> >> > > > > >> >> > > Hi Matt,
> >> > > > > >> >> > > can you paste some more master logs of region
> >> > > > > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
> >> > > > > >> >> > > I think Its not problem with splitting.
> >> > > > > >> >> > > We are getting
> >> > > > > >> >> > >       LOG.warn("Region " + encodedName + " not found
> on
> >> > > server
> >> > > > "
> >> > > > > +
> >> > > > > >> >> > > serverName +
> >> > > > > >> >> > >         "; failed processing");
> >> > > > > >> >> > > this log means no entry in servers map(not fully
> >> assigned).
> >> > > > > >> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
> >> > > > > >> >> > >     HRegionInfo foundHri = null;
> >> > > > > >> >> > >     for (HRegionInfo hri: hris) {
> >> > > > > >> >> > >       if (hri.getEncodedName().equals(encodedName)) {
> >> > > > > >> >> > >         foundHri = hri;
> >> > > > > >> >> > >         break;
> >> > > > > >> >> > >       }
> >> > > > > >> >> > >     }
> >> > > > > >> >> > >     return foundHri;
> >> > > > > >> >> > >
> >> > > > > >> >> > >
> >> > > > > >> >> > >
> >> > > > > >> >> > >
> >> > > > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <
> >> > > > > lhofhansl@yahoo.com>
> >> > > > > >> >> > wrote:
> >> > > > > >> >> > >
> >> > > > > >> >> > > > CC'ing dev list...
> >> > > > > >> >> > > >
> >> > > > > >> >> > > > Is anybody aware of any changes that went in
> recently
> >> > that
> >> > > > > could
> >> > > > > >> >> cause
> >> > > > > >> >> > > > this?
> >> > > > > >> >> > > > I looked around a bit, but could not find anything
> >> > obvious.
> >> > > > > >> >> > > >
> >> > > > > >> >> > > > -- Lars
> >> > > > > >> >> > > >
> >> > > > > >> >> > > >
> >> > > > > >> >> > > >
> >> > > > > >> >> > > > ________________________________
> >> > > > > >> >> > > >  From: Matt Corgan <mc...@hotpads.com>
> >> > > > > >> >> > > > To: user <us...@hbase.apache.org>
> >> > > > > >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
> >> > > > > >> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on
> >> .94.2
> >> > > > > >> >> > > >
> >> > > > > >> >> > > > I think the cluster is ok without running hbck, as
> >> > > restarting
> >> > > > > the
> >> > > > > >> >> > > > regionserver process stops the warnings and
> everything
> >> > > looks
> >> > > > ok
> >> > > > > >> >> > > otherwise.
> >> > > > > >> >> > > >
> >> > > > > >> >> > > > here's the regionserver right after the split
> happens:
> >> > > > > >> >> > > > ------------------------
> >> > > > > >> >> > > > 2012-11-01 22:45:28,726 DEBUG
> >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> >> > > transition
> >> > > > > node
> >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> >> > > > > >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> >> > > > transitioned
> >> > > > > >> node
> >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> >> > > > > >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> >> > > > > >> >> > > >
> org.apache.hadoop.hbase.regionserver.SplitTransaction:
> >> > > Still
> >> > > > > >> waiting
> >> > > > > >> >> on
> >> > > > > >> >> > > the
> >> > > > > >> >> > > > master to process the split for
> >> > > > > bc62a8a72124a4ba3f6b9f302587903c
> >> > > > > >> >> > > > 2012-11-01 22:45:28,832 DEBUG
> >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> >> > > transition
> >> > > > > node
> >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> >> > > > > >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
> >> > > > > >> >> > > > 2012-11-01 22:45:28,837 DEBUG
> >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> >> > > > transitioned
> >> > > > > >> node
> >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> >> > > > > >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
> >> > > > > >> >> > > > -----------------------------
> >> > > > > >> >> > > >
> >> > > > > >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to
> >> > > > > >> RS_ZK_REGION_SPLIT"
> >> > > > > >> >> > > > continues for 15 or so hours and finally settles
> >> without
> >> > > > manual
> >> > > > > >> >> > > > intervention with these regionserver log messages:
> >> > > > > >> >> > > > -----------------------
> >> > > > > >> >> > > > 2012-11-02 13:55:00,906 DEBUG
> >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> >> > > transition
> >> > > > > >> node *
> >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
> >> RS_ZK_REGION_SPLIT
> >> > > to
> >> > > > > >> >> > > > RS_ZK_REGION_SPLIT
> >> > > > > >> >> > > >
> >> > > > > >> >> > > > 2012-11-02 13:55:00,916 DEBUG
> >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> >> > > > transitioned
> >> > > > > >> node *
> >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
> >> RS_ZK_REGION_SPLIT
> >> > > to
> >> > > > > >> >> > > > RS_ZK_REGION_SPLIT
> >> > > > > >> >> > > >
> >> > > > > >> >> > > > 2012-11-02 13:55:00,916 INFO
> >> > > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest:
> >> Region
> >> > > > > split,
> >> > > > > >> META
> >> > > > > >> >> > > > updated, and report to master.
> >> > > > > >> >> > > >
> >> > > > > >> >> > > >
> >> > > > > >> >> > >
> >> > > > > >> >> >
> >> > > > > >> >>
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> >> > > > > >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
> >> > > > > >> >> > > >
> >> > > > > >> >> > > >
> >> > > > > >> >> > >
> >> > > > > >> >> >
> >> > > > > >> >>
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> >> > > > > >> >> > > >
> >> > > > > >> >> > > >
> >> > > > > >> >> > >
> >> > > > > >> >> >
> >> > > > > >> >>
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> >> > > > > >> >> > > > Split took 15hrs, 9mins, 33sec
> >> > > > > >> >> > > >
> >> > > > > >> >> > > > 2012-11-02 13:55:00,945 INFO
> >> > > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest:
> >> Region
> >> > > > > split,
> >> > > > > >> META
> >> > > > > >> >> > > > updated, and report to master.
> >> > > > > >> >> > > >
> >> > > > > >> >> > > >
> >> > > > > >> >> > >
> >> > > > > >> >> >
> >> > > > > >> >>
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> >> > > > > >> >> > > > new regions:
> >> > > > > >> >> > > >
> >> > > > > >> >> > > >
> >> > > > > >> >> > >
> >> > > > > >> >> >
> >> > > > > >> >>
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> >> > > > > >> >> > > >
> >> > > > > >> >> >
> >> > > > > >>
> >> > > >
> >> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took
> 17hrs,
> >> > > 14mins,
> >> > > > > 2sec
> >> > > > > >> >> > > > ----------------------
> >> > > > > >> >> > > >
> >> > > > > >> >> > > > And the master finally logs this:
> >> > > > > >> >> > > > -----------------------
> >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> >> > > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager:
> >> Region
> >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on server
> >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279;
> failed
> >> > > > > processing
> >> > > > > >> >> > > >
> >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> >> > > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager:
> >> > Received
> >> > > > > SPLIT
> >> > > > > >> for
> >> > > > > >> >> > > region
> >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
> >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it
> >> > > doesn't
> >> > > > > exist
> >> > > > > >> >> > > anymore,
> >> > > > > >> >> > > > probably already processed its split
> >> > > > > >> >> > > > ------------------------
> >> > > > > >> >> > > >
> >> > > > > >> >> > > > I can't find any evidence that the region left the
> >> node
> >> > > > during
> >> > > > > >> that
> >> > > > > >> >> > time.
> >> > > > > >> >> > > > I'll have to catch it in action next time and see
> what
> >> > the
> >> > > > > region
> >> > > > > >> is
> >> > > > > >> >> up
> >> > > > > >> >> > > to
> >> > > > > >> >> > > > during the problem period.
> >> > > > > >> >> > > >
> >> > > > > >> >> > > > What does it mean that it successfully transitioned
> >> from
> >> > > > SPLIT
> >> > > > > to
> >> > > > > >> >> > SPLIT?
> >> > > > > >> >> > > > Is that a valid transition?
> >> > > > > >> >> > > >
> >> > > > > >> >> > > > Matt
> >> > > > > >> >> > > >
> >> > > > > >> >> > > >
> >> > > > > >> >> > > >
> >> > > > > >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <
> >> > > yuzhihong@gmail.com>
> >> > > > > >> wrote:
> >> > > > > >> >> > > >
> >> > > > > >> >> > > > > Matt:
> >> > > > > >> >> > > > > From the following we can see that region
> >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c
> >> > > > > >> >> > > > > cannot be found:
> >> > > > > >> >> > > > >
> >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 DEBUG
> >> > > > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
> >> > > Handling
> >> > > > > >> >> > > > > transition=RS_ZK_REGION_SPLIT,
> >> > > > > >> >> > > > >
> >> server=HadoopNode162.hotpads.srv,60020,1351788248279,
> >> > > > > >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
> >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> >> > > > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
> >> > Region
> >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on
> >> server
> >> > > > > >> >> > > > >
> >> HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> >> > > > > >> >> > > > > processing
> >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> >> > > > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
> >> > > Received
> >> > > > > SPLIT
> >> > > > > >> >> for
> >> > > > > >> >> > > > region
> >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
> >> > > > > >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but
> it
> >> > > > doesn't
> >> > > > > >> exist
> >> > > > > >> >> > > > anymore,
> >> > > > > >> >> > > > > probably already processed its split
> >> > > > > >> >> > > > >
> >> > > > > >> >> > > > > Have you run hbck to repair your cluster ?
> >> > > > > >> >> > > > >
> >> > > > > >> >> > > > > Thanks
> >> > > > > >> >> > > > >
> >> > > > > >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
> >> > > > > >> mcorgan@hotpads.com>
> >> > > > > >> >> > > wrote:
> >> > > > > >> >> > > > >
> >> > > > > >> >> > > > > > Here's a sample of the master's logs from
> >> yesterday.
> >> > > >  It's
> >> > > > > not
> >> > > > > >> >> > > > correlated
> >> > > > > >> >> > > > > > exactly with the other pastebin log, but there's
> >> 3GB
> >> > of
> >> > > > > this
> >> > > > > >> from
> >> > > > > >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
> >> > > > > >> >> > > > > >
> >> > > > > >> >> > > > > > I'm am pushing the cluster a bit with importing
> >> data
> >> > so
> >> > > > > >> testing
> >> > > > > >> >> the
> >> > > > > >> >> > > > split
> >> > > > > >> >> > > > > > code harder than normal.  The regions are
> 500-1GB
> >> > > > gzipped.
> >> > > > >  I
> >> > > > > >> can
> >> > > > > >> >> > > look
> >> > > > > >> >> > > > > into
> >> > > > > >> >> > > > > > it more but trying to figure out what to look
> for.
> >> > > > > >> >> > > > > >
> >> > > > > >> >> > > > > > Thanks Ted,
> >> > > > > >> >> > > > > > Matt
> >> > > > > >> >> > > > > >
> >> > > > > >> >> > > > > >
> >> > > > > >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <
> >> > > > > yuzhihong@gmail.com>
> >> > > > > >> >> > wrote:
> >> > > > > >> >> > > > > >
> >> > > > > >> >> > > > > > > Matt:
> >> > > > > >> >> > > > > > > This is the method which made the logging:
> >> > > > > >> >> > > > > > >   private static int
> >> > tickleNodeSplit(ZooKeeperWatcher
> >> > > > > zkw,
> >> > > > > >> >> > > > > > >       HRegionInfo parent, HRegionInfo a,
> >> > HRegionInfo
> >> > > b,
> >> > > > > >> >> > ServerName
> >> > > > > >> >> > > > > > > serverName,
> >> > > > > >> >> > > > > > >       final int znodeVersion)
> >> > > > > >> >> > > > > > >   throws KeeperException, IOException {
> >> > > > > >> >> > > > > > >     byte [] payload = Writables.getBytes(a,
> b);
> >> > > > > >> >> > > > > > >     return ZKAssign.transitionNode(zkw,
> parent,
> >> > > > > serverName,
> >> > > > > >> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
> >> > > > > >> >> EventType.RS_ZK_REGION_SPLIT,
> >> > > > > >> >> > > > > > >       znodeVersion, payload);
> >> > > > > >> >> > > > > > >   }
> >> > > > > >> >> > > > > > >
> >> > > > > >> >> > > > > > > transitionZKNode() calls tickleNodeSplit()
> when
> >> > > waiting
> >> > > > > for
> >> > > > > >> >> > master
> >> > > > > >> >> > > to
> >> > > > > >> >> > > > > > split
> >> > > > > >> >> > > > > > > the region. Obviously something caused the
> >> master
> >> > not
> >> > > > > able
> >> > > > > >> to
> >> > > > > >> >> > > split.
> >> > > > > >> >> > > > > > >
> >> > > > > >> >> > > > > > > How large is the region ?
> >> > > > > >> >> > > > > > >
> >> > > > > >> >> > > > > > > Can you pastebin master log for that period of
> >> > time ?
> >> > > > > >> >> > > > > > >
> >> > > > > >> >> > > > > > > Thanks
> >> > > > > >> >> > > > > > >
> >> > > > > >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <
> >> > > > > >> >> mcorgan@hotpads.com
> >> > > > > >> >> > >
> >> > > > > >> >> > > > > wrote:
> >> > > > > >> >> > > > > > >
> >> > > > > >> >> > > > > > > > We upgraded from .94.0 to .94.2 last week
> and
> >> > have
> >> > > > > >> started to
> >> > > > > >> >> > > > > encounter
> >> > > > > >> >> > > > > > > > infinite loops of region-transition on
> splits.
> >> > I'm
> >> > > > not
> >> > > > > >> sure
> >> > > > > >> >> > yet
> >> > > > > >> >> > > if
> >> > > > > >> >> > > > > > it's
> >> > > > > >> >> > > > > > > > all splits nor if it's related to load.
> >> Solution
> >> > > so
> >> > > > > far
> >> > > > > >> has
> >> > > > > >> >> > been
> >> > > > > >> >> > > > to
> >> > > > > >> >> > > > > > > > restart the regionserver process.
> >> > > > > >> >> > > > > > > >
> >> > > > > >> >> > > > > > > > log snippet:
> >> > > > > >> >> > > > > > > > http://pastebin.com/LpienZ7B
> >> > > > > >> >> > > > > > > >
> >> > > > > >> >> > > > > > > > It's repeating these two lines:
> >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
> >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
> >> Attempting
> >> > to
> >> > > > > >> transition
> >> > > > > >> >> > > node
> >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> >> > > > > RS_ZK_REGION_SPLIT
> >> > > > > >> to
> >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
> >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
> >> Successfully
> >> > > > > >> >> transitioned
> >> > > > > >> >> > > node
> >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> >> > > > > RS_ZK_REGION_SPLIT
> >> > > > > >> to
> >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> >> > > > > >> >> > > > > > > >
> >> > > > > >> >> > > > > > > > with the occasional:
> >> > > > > >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
> >> > > > > >> >> > > > > > > >
> >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> >> > > > > >> Still
> >> > > > > >> >> > > > waiting
> >> > > > > >> >> > > > > on
> >> > > > > >> >> > > > > > > the
> >> > > > > >> >> > > > > > > > master to process the split for
> >> > > > > >> >> > cf3e9bc069e1888983c06dc8e053ffcf
> >> > > > > >> >> > > > > > > >
> >> > > > > >> >> > > > > > > > Should the region transition from
> >> > > RS_ZK_REGION_SPLIT
> >> > > > to
> >> > > > > >> >> itself?
> >> > > > > >> >> > >  It
> >> > > > > >> >> > > > > > looks
> >> > > > > >> >> > > > > > > > wrong, but I'm not familiar with the code at
> >> all.
> >> > > > > >> >> > > > > > > >
> >> > > > > >> >> > > > > > > > Thanks,
> >> > > > > >> >> > > > > > > > Matt
> >> > > > > >> >> > > > > > > >
> >> > > > > >> >> > > > > > >
> >> > > > > >> >> > > > > >
> >> > > > > >> >> > > > >
> >> > > > > >> >> > > >
> >> > > > > >> >> > >
> >> > > > > >> >> >
> >> > > > > >> >>
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by ramkrishna vasudevan <ra...@gmail.com>.

The log shows that the first time the region was transitioned to SPLITTING
even then it was not populated in the Master's memory.

On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan <
ramkrishna.s.vasudevan@gmail.com> wrote:

> Could you attach the master logs at this time
> 2012-11-05 00:24:55?
>
> Regards
> Ram
>
> On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <lh...@yahoo.com>wrote:
>
>> Took a brief look through all SPLIT related commits since 0.94.0... Found
>> these:
>>
>> HBASE-6854 *
>> HBASE-6713
>> HBASE-6329 *
>>
>> HBASE-6088
>>
>> HBASE-5986
>> HBASE-6070 *
>>
>>
>> The ones marked with * are (IMHO) more likely to be related.
>>
>> -- Lars
>>
>> ________________________________
>> From: Matt Corgan <mc...@hotpads.com>
>> To: dev <de...@hbase.apache.org>; lars hofhansl <lh...@yahoo.com>
>> Sent: Monday, November 5, 2012 9:28 PM
>> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>>
>> Yeah - we were running .94.0 since it came out but never saw it there.
>> I'll keep trying to narrow it down.  The only harm it's causing is log
>> spam and failing to move daughters to a new regionserver, which are
>> definitely problems, but it's not bringing down the cluster.
>>
>>
>> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <lh...@yahoo.com>
>> wrote:
>>
>> > So it seems you can repeat this to some extend in 0.94.2, but you have
>> > never seen this before?
>> >
>> >
>> > -- Lars
>> >
>> >
>> >
>> > ________________________________
>> >  From: Matt Corgan <mc...@hotpads.com>
>> > To: dev <de...@hbase.apache.org>
>> > Sent: Monday, November 5, 2012 9:10 PM
>> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>> >
>> > It happened in this new table that I have all the logs for.  The region
>> in
>> > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the first
>> trace
>> > of this region is on the regionserver with the "Intantiated
>> tableName..."
>> > message
>> >
>> > 2012-11-05 22:24:21,162 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion:
>> > Instantiated
>> >
>> >
>> StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74
>> > 9784c21cbc2.
>> >
>> > I also know this region came from a recent split, but I can't find any
>> log
>> > messages show the parent finishing the split that created this daughter
>> > region.  So my guess now is that the split is actually finishing and
>> > letting clients continue to write data, but something is failing to
>> print
>> > the log line and correctly tell the master about the new region.
>> >
>> > I've noticed that these regions are showing up on the clients in calls
>> to
>> > HTable.getRegionLocations(), so the clients continue to function, but
>> if I
>> > call HBaseAdmin.move() i get an UnknownRegionException.
>> >
>> >
>> > On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <yu...@gmail.com> wrote:
>> >
>> > > I think what Matt encountered was this:
>> > > https://issues.apache.org/jira/browse/HBASE-7101
>> > >
>> > > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <mc...@hotpads.com>
>> wrote:
>> > >
>> > > > Beats me - i can't find any record of it before that.
>> > > >
>> > > > I'm importing data into another table now.  I disabled/enabled the
>> > table
>> > > > first to make sure we have the original 4 region locations logged
>> > > > everywhere.  Will report back...
>> > > >
>> > > >
>> > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <
>> > jdcryans@apache.org
>> > > > >wrote:
>> > > >
>> > > > > Wow, and can you figure how this happened?
>> > > > >
>> > > > > 2012-11-05 00:24:54,538 DEBUG
>> > > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.
>> > > > >
>> > > > > Split? Else how did the master told that RS to open a region that
>> > > doesn't
>> > > > > exist?
>> > > > >
>> > > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <mc...@hotpads.com>
>> > > wrote:
>> > > > > > Here's one from last night with the master logs at the bottom.
>> > > > > > http://pastebin.com/cSdMbA2a
>> > > > > >
>> > > > > > I don't see that region in the master logs for 4 days back.  I'm
>> > > going
>> > > > to
>> > > > > > import some new data from scratch soon and will be sure to keep
>> all
>> > > the
>> > > > > > master logs.
>> > > > > >
>> > > > > >
>> > > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <
>> > > > jdcryans@apache.org
>> > > > > >wrote:
>> > > > > >
>> > > > > >> This reminds me a lot of
>> > > > > https://issues.apache.org/jira/browse/HBASE-4792
>> > > > > >>
>> > > > > >> What I'd like to see is the log from the first time the master
>> > > > > >> receives the split message. I guess it says the region doesn't
>> > exist
>> > > > > >> anymore because the split was processed already in the master
>> but
>> > > > > >> there's a failure mode similar to 4792.
>> > > > > >>
>> > > > > >> I saw this on another cluster last week but the logs were
>> rolling
>> > > > > >> based on size so the original split message was lost.
>> > > > > >>
>> > > > > >> J-D
>> > > > > >>
>> > > > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <
>> mcorgan@hotpads.com>
>> > > > > wrote:
>> > > > > >> > I've had a few successful splits today without any errors.
>> I'll
>> > > > turn
>> > > > > up
>> > > > > >> > the importer speed tomorrow to start stressing the cluster
>> more.
>> > > > > >> >
>> > > > > >> >
>> > > > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <yu...@gmail.com>
>> > > wrote:
>> > > > > >> >
>> > > > > >> >> Matt:
>> > > > > >> >> Was there any region server crash before you noticed the
>> > repeated
>> > > > log
>> > > > > >> for
>> > > > > >> >> bc62a8a72124a4ba3f6b9f302587903c ?
>> > > > > >> >>
>> > > > > >> >> I wonder if the following JIRA might be related:
>> > > > > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign missing
>> > > > daughter
>> > > > > >> >>
>> > > > > >> >> Thanks
>> > > > > >> >>
>> > > > > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <
>> > mcorgan@hotpads.com
>> > > >
>> > > > > >> wrote:
>> > > > > >> >>
>> > > > > >> >> > Strangely, I don't see any record of that region in the
>> > master
>> > > > > before
>> > > > > >> >> what
>> > > > > >> >> > I already pasted even though I have logs back to 10/30.
>> Next
>> > > > time
>> > > > > it
>> > > > > >> >> > happens I'll gather a full log record and try to debug
>> while
>> > > it's
>> > > > > >> >> > occurring.
>> > > > > >> >> >
>> > > > > >> >> >
>> > > > > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
>> > > > > >> >> > chrajeshbabu32@gmail.com> wrote:
>> > > > > >> >> >
>> > > > > >> >> > > Hi Matt,
>> > > > > >> >> > > can you paste some more master logs of region
>> > > > > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
>> > > > > >> >> > > I think Its not problem with splitting.
>> > > > > >> >> > > We are getting
>> > > > > >> >> > >       LOG.warn("Region " + encodedName + " not found on
>> > > server
>> > > > "
>> > > > > +
>> > > > > >> >> > > serverName +
>> > > > > >> >> > >         "; failed processing");
>> > > > > >> >> > > this log means no entry in servers map(not fully
>> assigned).
>> > > > > >> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
>> > > > > >> >> > >     HRegionInfo foundHri = null;
>> > > > > >> >> > >     for (HRegionInfo hri: hris) {
>> > > > > >> >> > >       if (hri.getEncodedName().equals(encodedName)) {
>> > > > > >> >> > >         foundHri = hri;
>> > > > > >> >> > >         break;
>> > > > > >> >> > >       }
>> > > > > >> >> > >     }
>> > > > > >> >> > >     return foundHri;
>> > > > > >> >> > >
>> > > > > >> >> > >
>> > > > > >> >> > >
>> > > > > >> >> > >
>> > > > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <
>> > > > > lhofhansl@yahoo.com>
>> > > > > >> >> > wrote:
>> > > > > >> >> > >
>> > > > > >> >> > > > CC'ing dev list...
>> > > > > >> >> > > >
>> > > > > >> >> > > > Is anybody aware of any changes that went in recently
>> > that
>> > > > > could
>> > > > > >> >> cause
>> > > > > >> >> > > > this?
>> > > > > >> >> > > > I looked around a bit, but could not find anything
>> > obvious.
>> > > > > >> >> > > >
>> > > > > >> >> > > > -- Lars
>> > > > > >> >> > > >
>> > > > > >> >> > > >
>> > > > > >> >> > > >
>> > > > > >> >> > > > ________________________________
>> > > > > >> >> > > >  From: Matt Corgan <mc...@hotpads.com>
>> > > > > >> >> > > > To: user <us...@hbase.apache.org>
>> > > > > >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
>> > > > > >> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on
>> .94.2
>> > > > > >> >> > > >
>> > > > > >> >> > > > I think the cluster is ok without running hbck, as
>> > > restarting
>> > > > > the
>> > > > > >> >> > > > regionserver process stops the warnings and everything
>> > > looks
>> > > > ok
>> > > > > >> >> > > otherwise.
>> > > > > >> >> > > >
>> > > > > >> >> > > > here's the regionserver right after the split happens:
>> > > > > >> >> > > > ------------------------
>> > > > > >> >> > > > 2012-11-01 22:45:28,726 DEBUG
>> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
>> > > transition
>> > > > > node
>> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
>> > > > > >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
>> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
>> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
>> > > > transitioned
>> > > > > >> node
>> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
>> > > > > >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
>> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
>> > > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
>> > > Still
>> > > > > >> waiting
>> > > > > >> >> on
>> > > > > >> >> > > the
>> > > > > >> >> > > > master to process the split for
>> > > > > bc62a8a72124a4ba3f6b9f302587903c
>> > > > > >> >> > > > 2012-11-01 22:45:28,832 DEBUG
>> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
>> > > transition
>> > > > > node
>> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
>> > > > > >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
>> > > > > >> >> > > > 2012-11-01 22:45:28,837 DEBUG
>> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
>> > > > transitioned
>> > > > > >> node
>> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
>> > > > > >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
>> > > > > >> >> > > > -----------------------------
>> > > > > >> >> > > >
>> > > > > >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to
>> > > > > >> RS_ZK_REGION_SPLIT"
>> > > > > >> >> > > > continues for 15 or so hours and finally settles
>> without
>> > > > manual
>> > > > > >> >> > > > intervention with these regionserver log messages:
>> > > > > >> >> > > > -----------------------
>> > > > > >> >> > > > 2012-11-02 13:55:00,906 DEBUG
>> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
>> > > transition
>> > > > > >> node *
>> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
>> RS_ZK_REGION_SPLIT
>> > > to
>> > > > > >> >> > > > RS_ZK_REGION_SPLIT
>> > > > > >> >> > > >
>> > > > > >> >> > > > 2012-11-02 13:55:00,916 DEBUG
>> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
>> > > > transitioned
>> > > > > >> node *
>> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
>> RS_ZK_REGION_SPLIT
>> > > to
>> > > > > >> >> > > > RS_ZK_REGION_SPLIT
>> > > > > >> >> > > >
>> > > > > >> >> > > > 2012-11-02 13:55:00,916 INFO
>> > > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest:
>> Region
>> > > > > split,
>> > > > > >> META
>> > > > > >> >> > > > updated, and report to master.
>> > > > > >> >> > > >
>> > > > > >> >> > > >
>> > > > > >> >> > >
>> > > > > >> >> >
>> > > > > >> >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
>> > > > > >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
>> > > > > >> >> > > >
>> > > > > >> >> > > >
>> > > > > >> >> > >
>> > > > > >> >> >
>> > > > > >> >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
>> > > > > >> >> > > >
>> > > > > >> >> > > >
>> > > > > >> >> > >
>> > > > > >> >> >
>> > > > > >> >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
>> > > > > >> >> > > > Split took 15hrs, 9mins, 33sec
>> > > > > >> >> > > >
>> > > > > >> >> > > > 2012-11-02 13:55:00,945 INFO
>> > > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest:
>> Region
>> > > > > split,
>> > > > > >> META
>> > > > > >> >> > > > updated, and report to master.
>> > > > > >> >> > > >
>> > > > > >> >> > > >
>> > > > > >> >> > >
>> > > > > >> >> >
>> > > > > >> >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
>> > > > > >> >> > > > new regions:
>> > > > > >> >> > > >
>> > > > > >> >> > > >
>> > > > > >> >> > >
>> > > > > >> >> >
>> > > > > >> >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
>> > > > > >> >> > > >
>> > > > > >> >> >
>> > > > > >>
>> > > >
>> > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
>> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs,
>> > > 14mins,
>> > > > > 2sec
>> > > > > >> >> > > > ----------------------
>> > > > > >> >> > > >
>> > > > > >> >> > > > And the master finally logs this:
>> > > > > >> >> > > > -----------------------
>> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
>> > > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager:
>> Region
>> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on server
>> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279; failed
>> > > > > processing
>> > > > > >> >> > > >
>> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
>> > > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager:
>> > Received
>> > > > > SPLIT
>> > > > > >> for
>> > > > > >> >> > > region
>> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
>> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it
>> > > doesn't
>> > > > > exist
>> > > > > >> >> > > anymore,
>> > > > > >> >> > > > probably already processed its split
>> > > > > >> >> > > > ------------------------
>> > > > > >> >> > > >
>> > > > > >> >> > > > I can't find any evidence that the region left the
>> node
>> > > > during
>> > > > > >> that
>> > > > > >> >> > time.
>> > > > > >> >> > > > I'll have to catch it in action next time and see what
>> > the
>> > > > > region
>> > > > > >> is
>> > > > > >> >> up
>> > > > > >> >> > > to
>> > > > > >> >> > > > during the problem period.
>> > > > > >> >> > > >
>> > > > > >> >> > > > What does it mean that it successfully transitioned
>> from
>> > > > SPLIT
>> > > > > to
>> > > > > >> >> > SPLIT?
>> > > > > >> >> > > > Is that a valid transition?
>> > > > > >> >> > > >
>> > > > > >> >> > > > Matt
>> > > > > >> >> > > >
>> > > > > >> >> > > >
>> > > > > >> >> > > >
>> > > > > >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <
>> > > yuzhihong@gmail.com>
>> > > > > >> wrote:
>> > > > > >> >> > > >
>> > > > > >> >> > > > > Matt:
>> > > > > >> >> > > > > From the following we can see that region
>> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c
>> > > > > >> >> > > > > cannot be found:
>> > > > > >> >> > > > >
>> > > > > >> >> > > > > 2012-11-02 00:00:02,909 DEBUG
>> > > > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
>> > > Handling
>> > > > > >> >> > > > > transition=RS_ZK_REGION_SPLIT,
>> > > > > >> >> > > > >
>> server=HadoopNode162.hotpads.srv,60020,1351788248279,
>> > > > > >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
>> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
>> > > > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
>> > Region
>> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on
>> server
>> > > > > >> >> > > > >
>> HadoopNode162.hotpads*.srv,60020,1351788248279;failed
>> > > > > >> >> > > > > processing
>> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
>> > > > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
>> > > Received
>> > > > > SPLIT
>> > > > > >> >> for
>> > > > > >> >> > > > region
>> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
>> > > > > >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it
>> > > > doesn't
>> > > > > >> exist
>> > > > > >> >> > > > anymore,
>> > > > > >> >> > > > > probably already processed its split
>> > > > > >> >> > > > >
>> > > > > >> >> > > > > Have you run hbck to repair your cluster ?
>> > > > > >> >> > > > >
>> > > > > >> >> > > > > Thanks
>> > > > > >> >> > > > >
>> > > > > >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
>> > > > > >> mcorgan@hotpads.com>
>> > > > > >> >> > > wrote:
>> > > > > >> >> > > > >
>> > > > > >> >> > > > > > Here's a sample of the master's logs from
>> yesterday.
>> > > >  It's
>> > > > > not
>> > > > > >> >> > > > correlated
>> > > > > >> >> > > > > > exactly with the other pastebin log, but there's
>> 3GB
>> > of
>> > > > > this
>> > > > > >> from
>> > > > > >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
>> > > > > >> >> > > > > >
>> > > > > >> >> > > > > > I'm am pushing the cluster a bit with importing
>> data
>> > so
>> > > > > >> testing
>> > > > > >> >> the
>> > > > > >> >> > > > split
>> > > > > >> >> > > > > > code harder than normal.  The regions are 500-1GB
>> > > > gzipped.
>> > > > >  I
>> > > > > >> can
>> > > > > >> >> > > look
>> > > > > >> >> > > > > into
>> > > > > >> >> > > > > > it more but trying to figure out what to look for.
>> > > > > >> >> > > > > >
>> > > > > >> >> > > > > > Thanks Ted,
>> > > > > >> >> > > > > > Matt
>> > > > > >> >> > > > > >
>> > > > > >> >> > > > > >
>> > > > > >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <
>> > > > > yuzhihong@gmail.com>
>> > > > > >> >> > wrote:
>> > > > > >> >> > > > > >
>> > > > > >> >> > > > > > > Matt:
>> > > > > >> >> > > > > > > This is the method which made the logging:
>> > > > > >> >> > > > > > >   private static int
>> > tickleNodeSplit(ZooKeeperWatcher
>> > > > > zkw,
>> > > > > >> >> > > > > > >       HRegionInfo parent, HRegionInfo a,
>> > HRegionInfo
>> > > b,
>> > > > > >> >> > ServerName
>> > > > > >> >> > > > > > > serverName,
>> > > > > >> >> > > > > > >       final int znodeVersion)
>> > > > > >> >> > > > > > >   throws KeeperException, IOException {
>> > > > > >> >> > > > > > >     byte [] payload = Writables.getBytes(a, b);
>> > > > > >> >> > > > > > >     return ZKAssign.transitionNode(zkw, parent,
>> > > > > serverName,
>> > > > > >> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
>> > > > > >> >> EventType.RS_ZK_REGION_SPLIT,
>> > > > > >> >> > > > > > >       znodeVersion, payload);
>> > > > > >> >> > > > > > >   }
>> > > > > >> >> > > > > > >
>> > > > > >> >> > > > > > > transitionZKNode() calls tickleNodeSplit() when
>> > > waiting
>> > > > > for
>> > > > > >> >> > master
>> > > > > >> >> > > to
>> > > > > >> >> > > > > > split
>> > > > > >> >> > > > > > > the region. Obviously something caused the
>> master
>> > not
>> > > > > able
>> > > > > >> to
>> > > > > >> >> > > split.
>> > > > > >> >> > > > > > >
>> > > > > >> >> > > > > > > How large is the region ?
>> > > > > >> >> > > > > > >
>> > > > > >> >> > > > > > > Can you pastebin master log for that period of
>> > time ?
>> > > > > >> >> > > > > > >
>> > > > > >> >> > > > > > > Thanks
>> > > > > >> >> > > > > > >
>> > > > > >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <
>> > > > > >> >> mcorgan@hotpads.com
>> > > > > >> >> > >
>> > > > > >> >> > > > > wrote:
>> > > > > >> >> > > > > > >
>> > > > > >> >> > > > > > > > We upgraded from .94.0 to .94.2 last week and
>> > have
>> > > > > >> started to
>> > > > > >> >> > > > > encounter
>> > > > > >> >> > > > > > > > infinite loops of region-transition on splits.
>> > I'm
>> > > > not
>> > > > > >> sure
>> > > > > >> >> > yet
>> > > > > >> >> > > if
>> > > > > >> >> > > > > > it's
>> > > > > >> >> > > > > > > > all splits nor if it's related to load.
>> Solution
>> > > so
>> > > > > far
>> > > > > >> has
>> > > > > >> >> > been
>> > > > > >> >> > > > to
>> > > > > >> >> > > > > > > > restart the regionserver process.
>> > > > > >> >> > > > > > > >
>> > > > > >> >> > > > > > > > log snippet:
>> > > > > >> >> > > > > > > > http://pastebin.com/LpienZ7B
>> > > > > >> >> > > > > > > >
>> > > > > >> >> > > > > > > > It's repeating these two lines:
>> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
>> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
>> Attempting
>> > to
>> > > > > >> transition
>> > > > > >> >> > > node
>> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
>> > > > > RS_ZK_REGION_SPLIT
>> > > > > >> to
>> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
>> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
>> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
>> Successfully
>> > > > > >> >> transitioned
>> > > > > >> >> > > node
>> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
>> > > > > RS_ZK_REGION_SPLIT
>> > > > > >> to
>> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
>> > > > > >> >> > > > > > > >
>> > > > > >> >> > > > > > > > with the occasional:
>> > > > > >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
>> > > > > >> >> > > > > > > >
>> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
>> > > > > >> Still
>> > > > > >> >> > > > waiting
>> > > > > >> >> > > > > on
>> > > > > >> >> > > > > > > the
>> > > > > >> >> > > > > > > > master to process the split for
>> > > > > >> >> > cf3e9bc069e1888983c06dc8e053ffcf
>> > > > > >> >> > > > > > > >
>> > > > > >> >> > > > > > > > Should the region transition from
>> > > RS_ZK_REGION_SPLIT
>> > > > to
>> > > > > >> >> itself?
>> > > > > >> >> > >  It
>> > > > > >> >> > > > > > looks
>> > > > > >> >> > > > > > > > wrong, but I'm not familiar with the code at
>> all.
>> > > > > >> >> > > > > > > >
>> > > > > >> >> > > > > > > > Thanks,
>> > > > > >> >> > > > > > > > Matt
>> > > > > >> >> > > > > > > >
>> > > > > >> >> > > > > > >
>> > > > > >> >> > > > > >
>> > > > > >> >> > > > >
>> > > > > >> >> > > >
>> > > > > >> >> > >
>> > > > > >> >> >
>> > > > > >> >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by ramkrishna vasudevan <ra...@gmail.com>.

Could you attach the master logs at this time
2012-11-05 00:24:55?

Regards
Ram

On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <lh...@yahoo.com> wrote:

> Took a brief look through all SPLIT related commits since 0.94.0... Found
> these:
>
> HBASE-6854 *
> HBASE-6713
> HBASE-6329 *
>
> HBASE-6088
>
> HBASE-5986
> HBASE-6070 *
>
>
> The ones marked with * are (IMHO) more likely to be related.
>
> -- Lars
>
> ________________________________
> From: Matt Corgan <mc...@hotpads.com>
> To: dev <de...@hbase.apache.org>; lars hofhansl <lh...@yahoo.com>
> Sent: Monday, November 5, 2012 9:28 PM
> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>
> Yeah - we were running .94.0 since it came out but never saw it there.
> I'll keep trying to narrow it down.  The only harm it's causing is log
> spam and failing to move daughters to a new regionserver, which are
> definitely problems, but it's not bringing down the cluster.
>
>
> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <lh...@yahoo.com> wrote:
>
> > So it seems you can repeat this to some extend in 0.94.2, but you have
> > never seen this before?
> >
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Matt Corgan <mc...@hotpads.com>
> > To: dev <de...@hbase.apache.org>
> > Sent: Monday, November 5, 2012 9:10 PM
> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> >
> > It happened in this new table that I have all the logs for.  The region
> in
> > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the first
> trace
> > of this region is on the regionserver with the "Intantiated tableName..."
> > message
> >
> > 2012-11-05 22:24:21,162 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion:
> > Instantiated
> >
> >
> StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74
> > 9784c21cbc2.
> >
> > I also know this region came from a recent split, but I can't find any
> log
> > messages show the parent finishing the split that created this daughter
> > region.  So my guess now is that the split is actually finishing and
> > letting clients continue to write data, but something is failing to print
> > the log line and correctly tell the master about the new region.
> >
> > I've noticed that these regions are showing up on the clients in calls to
> > HTable.getRegionLocations(), so the clients continue to function, but if
> I
> > call HBaseAdmin.move() i get an UnknownRegionException.
> >
> >
> > On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > I think what Matt encountered was this:
> > > https://issues.apache.org/jira/browse/HBASE-7101
> > >
> > > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <mc...@hotpads.com>
> wrote:
> > >
> > > > Beats me - i can't find any record of it before that.
> > > >
> > > > I'm importing data into another table now.  I disabled/enabled the
> > table
> > > > first to make sure we have the original 4 region locations logged
> > > > everywhere.  Will report back...
> > > >
> > > >
> > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <
> > jdcryans@apache.org
> > > > >wrote:
> > > >
> > > > > Wow, and can you figure how this happened?
> > > > >
> > > > > 2012-11-05 00:24:54,538 DEBUG
> > > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated
> > > > >
> > > > >
> > > >
> > >
> >
> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.
> > > > >
> > > > > Split? Else how did the master told that RS to open a region that
> > > doesn't
> > > > > exist?
> > > > >
> > > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <mc...@hotpads.com>
> > > wrote:
> > > > > > Here's one from last night with the master logs at the bottom.
> > > > > > http://pastebin.com/cSdMbA2a
> > > > > >
> > > > > > I don't see that region in the master logs for 4 days back.  I'm
> > > going
> > > > to
> > > > > > import some new data from scratch soon and will be sure to keep
> all
> > > the
> > > > > > master logs.
> > > > > >
> > > > > >
> > > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <
> > > > jdcryans@apache.org
> > > > > >wrote:
> > > > > >
> > > > > >> This reminds me a lot of
> > > > > https://issues.apache.org/jira/browse/HBASE-4792
> > > > > >>
> > > > > >> What I'd like to see is the log from the first time the master
> > > > > >> receives the split message. I guess it says the region doesn't
> > exist
> > > > > >> anymore because the split was processed already in the master
> but
> > > > > >> there's a failure mode similar to 4792.
> > > > > >>
> > > > > >> I saw this on another cluster last week but the logs were
> rolling
> > > > > >> based on size so the original split message was lost.
> > > > > >>
> > > > > >> J-D
> > > > > >>
> > > > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <
> mcorgan@hotpads.com>
> > > > > wrote:
> > > > > >> > I've had a few successful splits today without any errors.
> I'll
> > > > turn
> > > > > up
> > > > > >> > the importer speed tomorrow to start stressing the cluster
> more.
> > > > > >> >
> > > > > >> >
> > > > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <yu...@gmail.com>
> > > wrote:
> > > > > >> >
> > > > > >> >> Matt:
> > > > > >> >> Was there any region server crash before you noticed the
> > repeated
> > > > log
> > > > > >> for
> > > > > >> >> bc62a8a72124a4ba3f6b9f302587903c ?
> > > > > >> >>
> > > > > >> >> I wonder if the following JIRA might be related:
> > > > > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign missing
> > > > daughter
> > > > > >> >>
> > > > > >> >> Thanks
> > > > > >> >>
> > > > > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <
> > mcorgan@hotpads.com
> > > >
> > > > > >> wrote:
> > > > > >> >>
> > > > > >> >> > Strangely, I don't see any record of that region in the
> > master
> > > > > before
> > > > > >> >> what
> > > > > >> >> > I already pasted even though I have logs back to 10/30.
> Next
> > > > time
> > > > > it
> > > > > >> >> > happens I'll gather a full log record and try to debug
> while
> > > it's
> > > > > >> >> > occurring.
> > > > > >> >> >
> > > > > >> >> >
> > > > > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
> > > > > >> >> > chrajeshbabu32@gmail.com> wrote:
> > > > > >> >> >
> > > > > >> >> > > Hi Matt,
> > > > > >> >> > > can you paste some more master logs of region
> > > > > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
> > > > > >> >> > > I think Its not problem with splitting.
> > > > > >> >> > > We are getting
> > > > > >> >> > >       LOG.warn("Region " + encodedName + " not found on
> > > server
> > > > "
> > > > > +
> > > > > >> >> > > serverName +
> > > > > >> >> > >         "; failed processing");
> > > > > >> >> > > this log means no entry in servers map(not fully
> assigned).
> > > > > >> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
> > > > > >> >> > >     HRegionInfo foundHri = null;
> > > > > >> >> > >     for (HRegionInfo hri: hris) {
> > > > > >> >> > >       if (hri.getEncodedName().equals(encodedName)) {
> > > > > >> >> > >         foundHri = hri;
> > > > > >> >> > >         break;
> > > > > >> >> > >       }
> > > > > >> >> > >     }
> > > > > >> >> > >     return foundHri;
> > > > > >> >> > >
> > > > > >> >> > >
> > > > > >> >> > >
> > > > > >> >> > >
> > > > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <
> > > > > lhofhansl@yahoo.com>
> > > > > >> >> > wrote:
> > > > > >> >> > >
> > > > > >> >> > > > CC'ing dev list...
> > > > > >> >> > > >
> > > > > >> >> > > > Is anybody aware of any changes that went in recently
> > that
> > > > > could
> > > > > >> >> cause
> > > > > >> >> > > > this?
> > > > > >> >> > > > I looked around a bit, but could not find anything
> > obvious.
> > > > > >> >> > > >
> > > > > >> >> > > > -- Lars
> > > > > >> >> > > >
> > > > > >> >> > > >
> > > > > >> >> > > >
> > > > > >> >> > > > ________________________________
> > > > > >> >> > > >  From: Matt Corgan <mc...@hotpads.com>
> > > > > >> >> > > > To: user <us...@hbase.apache.org>
> > > > > >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
> > > > > >> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on
> .94.2
> > > > > >> >> > > >
> > > > > >> >> > > > I think the cluster is ok without running hbck, as
> > > restarting
> > > > > the
> > > > > >> >> > > > regionserver process stops the warnings and everything
> > > looks
> > > > ok
> > > > > >> >> > > otherwise.
> > > > > >> >> > > >
> > > > > >> >> > > > here's the regionserver right after the split happens:
> > > > > >> >> > > > ------------------------
> > > > > >> >> > > > 2012-11-01 22:45:28,726 DEBUG
> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> > > transition
> > > > > node
> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> > > > > >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > > > transitioned
> > > > > >> node
> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > > > > >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> > > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> > > Still
> > > > > >> waiting
> > > > > >> >> on
> > > > > >> >> > > the
> > > > > >> >> > > > master to process the split for
> > > > > bc62a8a72124a4ba3f6b9f302587903c
> > > > > >> >> > > > 2012-11-01 22:45:28,832 DEBUG
> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> > > transition
> > > > > node
> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> > > > > >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
> > > > > >> >> > > > 2012-11-01 22:45:28,837 DEBUG
> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > > > transitioned
> > > > > >> node
> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > > > > >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
> > > > > >> >> > > > -----------------------------
> > > > > >> >> > > >
> > > > > >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to
> > > > > >> RS_ZK_REGION_SPLIT"
> > > > > >> >> > > > continues for 15 or so hours and finally settles
> without
> > > > manual
> > > > > >> >> > > > intervention with these regionserver log messages:
> > > > > >> >> > > > -----------------------
> > > > > >> >> > > > 2012-11-02 13:55:00,906 DEBUG
> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> > > transition
> > > > > >> node *
> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
> RS_ZK_REGION_SPLIT
> > > to
> > > > > >> >> > > > RS_ZK_REGION_SPLIT
> > > > > >> >> > > >
> > > > > >> >> > > > 2012-11-02 13:55:00,916 DEBUG
> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > > > transitioned
> > > > > >> node *
> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
> RS_ZK_REGION_SPLIT
> > > to
> > > > > >> >> > > > RS_ZK_REGION_SPLIT
> > > > > >> >> > > >
> > > > > >> >> > > > 2012-11-02 13:55:00,916 INFO
> > > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest:
> Region
> > > > > split,
> > > > > >> META
> > > > > >> >> > > > updated, and report to master.
> > > > > >> >> > > >
> > > > > >> >> > > >
> > > > > >> >> > >
> > > > > >> >> >
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> > > > > >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
> > > > > >> >> > > >
> > > > > >> >> > > >
> > > > > >> >> > >
> > > > > >> >> >
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> > > > > >> >> > > >
> > > > > >> >> > > >
> > > > > >> >> > >
> > > > > >> >> >
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> > > > > >> >> > > > Split took 15hrs, 9mins, 33sec
> > > > > >> >> > > >
> > > > > >> >> > > > 2012-11-02 13:55:00,945 INFO
> > > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest:
> Region
> > > > > split,
> > > > > >> META
> > > > > >> >> > > > updated, and report to master.
> > > > > >> >> > > >
> > > > > >> >> > > >
> > > > > >> >> > >
> > > > > >> >> >
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> > > > > >> >> > > > new regions:
> > > > > >> >> > > >
> > > > > >> >> > > >
> > > > > >> >> > >
> > > > > >> >> >
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> > > > > >> >> > > >
> > > > > >> >> >
> > > > > >>
> > > >
> > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs,
> > > 14mins,
> > > > > 2sec
> > > > > >> >> > > > ----------------------
> > > > > >> >> > > >
> > > > > >> >> > > > And the master finally logs this:
> > > > > >> >> > > > -----------------------
> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> > > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager:
> Region
> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on server
> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279; failed
> > > > > processing
> > > > > >> >> > > >
> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> > > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager:
> > Received
> > > > > SPLIT
> > > > > >> for
> > > > > >> >> > > region
> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it
> > > doesn't
> > > > > exist
> > > > > >> >> > > anymore,
> > > > > >> >> > > > probably already processed its split
> > > > > >> >> > > > ------------------------
> > > > > >> >> > > >
> > > > > >> >> > > > I can't find any evidence that the region left the node
> > > > during
> > > > > >> that
> > > > > >> >> > time.
> > > > > >> >> > > > I'll have to catch it in action next time and see what
> > the
> > > > > region
> > > > > >> is
> > > > > >> >> up
> > > > > >> >> > > to
> > > > > >> >> > > > during the problem period.
> > > > > >> >> > > >
> > > > > >> >> > > > What does it mean that it successfully transitioned
> from
> > > > SPLIT
> > > > > to
> > > > > >> >> > SPLIT?
> > > > > >> >> > > > Is that a valid transition?
> > > > > >> >> > > >
> > > > > >> >> > > > Matt
> > > > > >> >> > > >
> > > > > >> >> > > >
> > > > > >> >> > > >
> > > > > >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <
> > > yuzhihong@gmail.com>
> > > > > >> wrote:
> > > > > >> >> > > >
> > > > > >> >> > > > > Matt:
> > > > > >> >> > > > > From the following we can see that region
> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c
> > > > > >> >> > > > > cannot be found:
> > > > > >> >> > > > >
> > > > > >> >> > > > > 2012-11-02 00:00:02,909 DEBUG
> > > > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
> > > Handling
> > > > > >> >> > > > > transition=RS_ZK_REGION_SPLIT,
> > > > > >> >> > > > > server=HadoopNode162.hotpads.srv,60020,1351788248279,
> > > > > >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> > > > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
> > Region
> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on server
> > > > > >> >> > > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> > > > > >> >> > > > > processing
> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> > > > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
> > > Received
> > > > > SPLIT
> > > > > >> >> for
> > > > > >> >> > > > region
> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > > > >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it
> > > > doesn't
> > > > > >> exist
> > > > > >> >> > > > anymore,
> > > > > >> >> > > > > probably already processed its split
> > > > > >> >> > > > >
> > > > > >> >> > > > > Have you run hbck to repair your cluster ?
> > > > > >> >> > > > >
> > > > > >> >> > > > > Thanks
> > > > > >> >> > > > >
> > > > > >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
> > > > > >> mcorgan@hotpads.com>
> > > > > >> >> > > wrote:
> > > > > >> >> > > > >
> > > > > >> >> > > > > > Here's a sample of the master's logs from
> yesterday.
> > > >  It's
> > > > > not
> > > > > >> >> > > > correlated
> > > > > >> >> > > > > > exactly with the other pastebin log, but there's
> 3GB
> > of
> > > > > this
> > > > > >> from
> > > > > >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > I'm am pushing the cluster a bit with importing
> data
> > so
> > > > > >> testing
> > > > > >> >> the
> > > > > >> >> > > > split
> > > > > >> >> > > > > > code harder than normal.  The regions are 500-1GB
> > > > gzipped.
> > > > >  I
> > > > > >> can
> > > > > >> >> > > look
> > > > > >> >> > > > > into
> > > > > >> >> > > > > > it more but trying to figure out what to look for.
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > Thanks Ted,
> > > > > >> >> > > > > > Matt
> > > > > >> >> > > > > >
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <
> > > > > yuzhihong@gmail.com>
> > > > > >> >> > wrote:
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > > Matt:
> > > > > >> >> > > > > > > This is the method which made the logging:
> > > > > >> >> > > > > > >   private static int
> > tickleNodeSplit(ZooKeeperWatcher
> > > > > zkw,
> > > > > >> >> > > > > > >       HRegionInfo parent, HRegionInfo a,
> > HRegionInfo
> > > b,
> > > > > >> >> > ServerName
> > > > > >> >> > > > > > > serverName,
> > > > > >> >> > > > > > >       final int znodeVersion)
> > > > > >> >> > > > > > >   throws KeeperException, IOException {
> > > > > >> >> > > > > > >     byte [] payload = Writables.getBytes(a, b);
> > > > > >> >> > > > > > >     return ZKAssign.transitionNode(zkw, parent,
> > > > > serverName,
> > > > > >> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
> > > > > >> >> EventType.RS_ZK_REGION_SPLIT,
> > > > > >> >> > > > > > >       znodeVersion, payload);
> > > > > >> >> > > > > > >   }
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > > > transitionZKNode() calls tickleNodeSplit() when
> > > waiting
> > > > > for
> > > > > >> >> > master
> > > > > >> >> > > to
> > > > > >> >> > > > > > split
> > > > > >> >> > > > > > > the region. Obviously something caused the master
> > not
> > > > > able
> > > > > >> to
> > > > > >> >> > > split.
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > > > How large is the region ?
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > > > Can you pastebin master log for that period of
> > time ?
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > > > Thanks
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <
> > > > > >> >> mcorgan@hotpads.com
> > > > > >> >> > >
> > > > > >> >> > > > > wrote:
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > > > > We upgraded from .94.0 to .94.2 last week and
> > have
> > > > > >> started to
> > > > > >> >> > > > > encounter
> > > > > >> >> > > > > > > > infinite loops of region-transition on splits.
> > I'm
> > > > not
> > > > > >> sure
> > > > > >> >> > yet
> > > > > >> >> > > if
> > > > > >> >> > > > > > it's
> > > > > >> >> > > > > > > > all splits nor if it's related to load.
> Solution
> > > so
> > > > > far
> > > > > >> has
> > > > > >> >> > been
> > > > > >> >> > > > to
> > > > > >> >> > > > > > > > restart the regionserver process.
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > log snippet:
> > > > > >> >> > > > > > > > http://pastebin.com/LpienZ7B
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > It's repeating these two lines:
> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Attempting
> > to
> > > > > >> transition
> > > > > >> >> > > node
> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> > > > > RS_ZK_REGION_SPLIT
> > > > > >> to
> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
> Successfully
> > > > > >> >> transitioned
> > > > > >> >> > > node
> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> > > > > RS_ZK_REGION_SPLIT
> > > > > >> to
> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > with the occasional:
> > > > > >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
> > > > > >> >> > > > > > > >
> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> > > > > >> Still
> > > > > >> >> > > > waiting
> > > > > >> >> > > > > on
> > > > > >> >> > > > > > > the
> > > > > >> >> > > > > > > > master to process the split for
> > > > > >> >> > cf3e9bc069e1888983c06dc8e053ffcf
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > Should the region transition from
> > > RS_ZK_REGION_SPLIT
> > > > to
> > > > > >> >> itself?
> > > > > >> >> > >  It
> > > > > >> >> > > > > > looks
> > > > > >> >> > > > > > > > wrong, but I'm not familiar with the code at
> all.
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > > > Thanks,
> > > > > >> >> > > > > > > > Matt
> > > > > >> >> > > > > > > >
> > > > > >> >> > > > > > >
> > > > > >> >> > > > > >
> > > > > >> >> > > > >
> > > > > >> >> > > >
> > > > > >> >> > >
> > > > > >> >> >
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by lars hofhansl <lh...@yahoo.com>.

Took a brief look through all SPLIT related commits since 0.94.0... Found these:

HBASE-6854 *
HBASE-6713
HBASE-6329 * 

HBASE-6088

HBASE-5986
HBASE-6070 *


The ones marked with * are (IMHO) more likely to be related.

-- Lars

________________________________
From: Matt Corgan <mc...@hotpads.com>
To: dev <de...@hbase.apache.org>; lars hofhansl <lh...@yahoo.com> 
Sent: Monday, November 5, 2012 9:28 PM
Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Yeah - we were running .94.0 since it came out but never saw it there.
I'll keep trying to narrow it down.  The only harm it's causing is log
spam and failing to move daughters to a new regionserver, which are
definitely problems, but it's not bringing down the cluster.


On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <lh...@yahoo.com> wrote:

> So it seems you can repeat this to some extend in 0.94.2, but you have
> never seen this before?
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Matt Corgan <mc...@hotpads.com>
> To: dev <de...@hbase.apache.org>
> Sent: Monday, November 5, 2012 9:10 PM
> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>
> It happened in this new table that I have all the logs for.  The region in
> question this time is 6839663e4f8f79be3d7469784c21cbc2, and the first trace
> of this region is on the regionserver with the "Intantiated tableName..."
> message
>
> 2012-11-05 22:24:21,162 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
> Instantiated
>
> StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74
> 9784c21cbc2.
>
> I also know this region came from a recent split, but I can't find any log
> messages show the parent finishing the split that created this daughter
> region.  So my guess now is that the split is actually finishing and
> letting clients continue to write data, but something is failing to print
> the log line and correctly tell the master about the new region.
>
> I've noticed that these regions are showing up on the clients in calls to
> HTable.getRegionLocations(), so the clients continue to function, but if I
> call HBaseAdmin.move() i get an UnknownRegionException.
>
>
> On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > I think what Matt encountered was this:
> > https://issues.apache.org/jira/browse/HBASE-7101
> >
> > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <mc...@hotpads.com> wrote:
> >
> > > Beats me - i can't find any record of it before that.
> > >
> > > I'm importing data into another table now.  I disabled/enabled the
> table
> > > first to make sure we have the original 4 region locations logged
> > > everywhere.  Will report back...
> > >
> > >
> > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <
> jdcryans@apache.org
> > > >wrote:
> > >
> > > > Wow, and can you figure how this happened?
> > > >
> > > > 2012-11-05 00:24:54,538 DEBUG
> > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated
> > > >
> > > >
> > >
> >
> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.
> > > >
> > > > Split? Else how did the master told that RS to open a region that
> > doesn't
> > > > exist?
> > > >
> > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <mc...@hotpads.com>
> > wrote:
> > > > > Here's one from last night with the master logs at the bottom.
> > > > > http://pastebin.com/cSdMbA2a
> > > > >
> > > > > I don't see that region in the master logs for 4 days back.  I'm
> > going
> > > to
> > > > > import some new data from scratch soon and will be sure to keep all
> > the
> > > > > master logs.
> > > > >
> > > > >
> > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <
> > > jdcryans@apache.org
> > > > >wrote:
> > > > >
> > > > >> This reminds me a lot of
> > > > https://issues.apache.org/jira/browse/HBASE-4792
> > > > >>
> > > > >> What I'd like to see is the log from the first time the master
> > > > >> receives the split message. I guess it says the region doesn't
> exist
> > > > >> anymore because the split was processed already in the master but
> > > > >> there's a failure mode similar to 4792.
> > > > >>
> > > > >> I saw this on another cluster last week but the logs were rolling
> > > > >> based on size so the original split message was lost.
> > > > >>
> > > > >> J-D
> > > > >>
> > > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <mc...@hotpads.com>
> > > > wrote:
> > > > >> > I've had a few successful splits today without any errors.  I'll
> > > turn
> > > > up
> > > > >> > the importer speed tomorrow to start stressing the cluster more.
> > > > >> >
> > > > >> >
> > > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <yu...@gmail.com>
> > wrote:
> > > > >> >
> > > > >> >> Matt:
> > > > >> >> Was there any region server crash before you noticed the
> repeated
> > > log
> > > > >> for
> > > > >> >> bc62a8a72124a4ba3f6b9f302587903c ?
> > > > >> >>
> > > > >> >> I wonder if the following JIRA might be related:
> > > > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign missing
> > > daughter
> > > > >> >>
> > > > >> >> Thanks
> > > > >> >>
> > > > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <
> mcorgan@hotpads.com
> > >
> > > > >> wrote:
> > > > >> >>
> > > > >> >> > Strangely, I don't see any record of that region in the
> master
> > > > before
> > > > >> >> what
> > > > >> >> > I already pasted even though I have logs back to 10/30.  Next
> > > time
> > > > it
> > > > >> >> > happens I'll gather a full log record and try to debug while
> > it's
> > > > >> >> > occurring.
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
> > > > >> >> > chrajeshbabu32@gmail.com> wrote:
> > > > >> >> >
> > > > >> >> > > Hi Matt,
> > > > >> >> > > can you paste some more master logs of region
> > > > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
> > > > >> >> > > I think Its not problem with splitting.
> > > > >> >> > > We are getting
> > > > >> >> > >       LOG.warn("Region " + encodedName + " not found on
> > server
> > > "
> > > > +
> > > > >> >> > > serverName +
> > > > >> >> > >         "; failed processing");
> > > > >> >> > > this log means no entry in servers map(not fully assigned).
> > > > >> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
> > > > >> >> > >     HRegionInfo foundHri = null;
> > > > >> >> > >     for (HRegionInfo hri: hris) {
> > > > >> >> > >       if (hri.getEncodedName().equals(encodedName)) {
> > > > >> >> > >         foundHri = hri;
> > > > >> >> > >         break;
> > > > >> >> > >       }
> > > > >> >> > >     }
> > > > >> >> > >     return foundHri;
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <
> > > > lhofhansl@yahoo.com>
> > > > >> >> > wrote:
> > > > >> >> > >
> > > > >> >> > > > CC'ing dev list...
> > > > >> >> > > >
> > > > >> >> > > > Is anybody aware of any changes that went in recently
> that
> > > > could
> > > > >> >> cause
> > > > >> >> > > > this?
> > > > >> >> > > > I looked around a bit, but could not find anything
> obvious.
> > > > >> >> > > >
> > > > >> >> > > > -- Lars
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > > ________________________________
> > > > >> >> > > >  From: Matt Corgan <mc...@hotpads.com>
> > > > >> >> > > > To: user <us...@hbase.apache.org>
> > > > >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
> > > > >> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> > > > >> >> > > >
> > > > >> >> > > > I think the cluster is ok without running hbck, as
> > restarting
> > > > the
> > > > >> >> > > > regionserver process stops the warnings and everything
> > looks
> > > ok
> > > > >> >> > > otherwise.
> > > > >> >> > > >
> > > > >> >> > > > here's the regionserver right after the split happens:
> > > > >> >> > > > ------------------------
> > > > >> >> > > > 2012-11-01 22:45:28,726 DEBUG
> > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> > transition
> > > > node
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> > > > >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > > transitioned
> > > > >> node
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > > > >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> > Still
> > > > >> waiting
> > > > >> >> on
> > > > >> >> > > the
> > > > >> >> > > > master to process the split for
> > > > bc62a8a72124a4ba3f6b9f302587903c
> > > > >> >> > > > 2012-11-01 22:45:28,832 DEBUG
> > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> > transition
> > > > node
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> > > > >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
> > > > >> >> > > > 2012-11-01 22:45:28,837 DEBUG
> > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > > transitioned
> > > > >> node
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > > > >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
> > > > >> >> > > > -----------------------------
> > > > >> >> > > >
> > > > >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to
> > > > >> RS_ZK_REGION_SPLIT"
> > > > >> >> > > > continues for 15 or so hours and finally settles without
> > > manual
> > > > >> >> > > > intervention with these regionserver log messages:
> > > > >> >> > > > -----------------------
> > > > >> >> > > > 2012-11-02 13:55:00,906 DEBUG
> > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> > transition
> > > > >> node *
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT
> > to
> > > > >> >> > > > RS_ZK_REGION_SPLIT
> > > > >> >> > > >
> > > > >> >> > > > 2012-11-02 13:55:00,916 DEBUG
> > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > > transitioned
> > > > >> node *
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT
> > to
> > > > >> >> > > > RS_ZK_REGION_SPLIT
> > > > >> >> > > >
> > > > >> >> > > > 2012-11-02 13:55:00,916 INFO
> > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region
> > > > split,
> > > > >> META
> > > > >> >> > > > updated, and report to master.
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> > > > >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> > > > >> >> > > > Split took 15hrs, 9mins, 33sec
> > > > >> >> > > >
> > > > >> >> > > > 2012-11-02 13:55:00,945 INFO
> > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region
> > > > split,
> > > > >> META
> > > > >> >> > > > updated, and report to master.
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> > > > >> >> > > > new regions:
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> > > > >> >> > > >
> > > > >> >> >
> > > > >>
> > >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs,
> > 14mins,
> > > > 2sec
> > > > >> >> > > > ----------------------
> > > > >> >> > > >
> > > > >> >> > > > And the master finally logs this:
> > > > >> >> > > > -----------------------
> > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on server
> > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279; failed
> > > > processing
> > > > >> >> > > >
> > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager:
> Received
> > > > SPLIT
> > > > >> for
> > > > >> >> > > region
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it
> > doesn't
> > > > exist
> > > > >> >> > > anymore,
> > > > >> >> > > > probably already processed its split
> > > > >> >> > > > ------------------------
> > > > >> >> > > >
> > > > >> >> > > > I can't find any evidence that the region left the node
> > > during
> > > > >> that
> > > > >> >> > time.
> > > > >> >> > > > I'll have to catch it in action next time and see what
> the
> > > > region
> > > > >> is
> > > > >> >> up
> > > > >> >> > > to
> > > > >> >> > > > during the problem period.
> > > > >> >> > > >
> > > > >> >> > > > What does it mean that it successfully transitioned from
> > > SPLIT
> > > > to
> > > > >> >> > SPLIT?
> > > > >> >> > > > Is that a valid transition?
> > > > >> >> > > >
> > > > >> >> > > > Matt
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <
> > yuzhihong@gmail.com>
> > > > >> wrote:
> > > > >> >> > > >
> > > > >> >> > > > > Matt:
> > > > >> >> > > > > From the following we can see that region
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c
> > > > >> >> > > > > cannot be found:
> > > > >> >> > > > >
> > > > >> >> > > > > 2012-11-02 00:00:02,909 DEBUG
> > > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
> > Handling
> > > > >> >> > > > > transition=RS_ZK_REGION_SPLIT,
> > > > >> >> > > > > server=HadoopNode162.hotpads.srv,60020,1351788248279,
> > > > >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
> > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> > > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
> Region
> > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on server
> > > > >> >> > > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> > > > >> >> > > > > processing
> > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> > > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
> > Received
> > > > SPLIT
> > > > >> >> for
> > > > >> >> > > > region
> > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > > >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it
> > > doesn't
> > > > >> exist
> > > > >> >> > > > anymore,
> > > > >> >> > > > > probably already processed its split
> > > > >> >> > > > >
> > > > >> >> > > > > Have you run hbck to repair your cluster ?
> > > > >> >> > > > >
> > > > >> >> > > > > Thanks
> > > > >> >> > > > >
> > > > >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
> > > > >> mcorgan@hotpads.com>
> > > > >> >> > > wrote:
> > > > >> >> > > > >
> > > > >> >> > > > > > Here's a sample of the master's logs from yesterday.
> > >  It's
> > > > not
> > > > >> >> > > > correlated
> > > > >> >> > > > > > exactly with the other pastebin log, but there's 3GB
> of
> > > > this
> > > > >> from
> > > > >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
> > > > >> >> > > > > >
> > > > >> >> > > > > > I'm am pushing the cluster a bit with importing data
> so
> > > > >> testing
> > > > >> >> the
> > > > >> >> > > > split
> > > > >> >> > > > > > code harder than normal.  The regions are 500-1GB
> > > gzipped.
> > > >  I
> > > > >> can
> > > > >> >> > > look
> > > > >> >> > > > > into
> > > > >> >> > > > > > it more but trying to figure out what to look for.
> > > > >> >> > > > > >
> > > > >> >> > > > > > Thanks Ted,
> > > > >> >> > > > > > Matt
> > > > >> >> > > > > >
> > > > >> >> > > > > >
> > > > >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <
> > > > yuzhihong@gmail.com>
> > > > >> >> > wrote:
> > > > >> >> > > > > >
> > > > >> >> > > > > > > Matt:
> > > > >> >> > > > > > > This is the method which made the logging:
> > > > >> >> > > > > > >   private static int
> tickleNodeSplit(ZooKeeperWatcher
> > > > zkw,
> > > > >> >> > > > > > >       HRegionInfo parent, HRegionInfo a,
> HRegionInfo
> > b,
> > > > >> >> > ServerName
> > > > >> >> > > > > > > serverName,
> > > > >> >> > > > > > >       final int znodeVersion)
> > > > >> >> > > > > > >   throws KeeperException, IOException {
> > > > >> >> > > > > > >     byte [] payload = Writables.getBytes(a, b);
> > > > >> >> > > > > > >     return ZKAssign.transitionNode(zkw, parent,
> > > > serverName,
> > > > >> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
> > > > >> >> EventType.RS_ZK_REGION_SPLIT,
> > > > >> >> > > > > > >       znodeVersion, payload);
> > > > >> >> > > > > > >   }
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > transitionZKNode() calls tickleNodeSplit() when
> > waiting
> > > > for
> > > > >> >> > master
> > > > >> >> > > to
> > > > >> >> > > > > > split
> > > > >> >> > > > > > > the region. Obviously something caused the master
> not
> > > > able
> > > > >> to
> > > > >> >> > > split.
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > How large is the region ?
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > Can you pastebin master log for that period of
> time ?
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > Thanks
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <
> > > > >> >> mcorgan@hotpads.com
> > > > >> >> > >
> > > > >> >> > > > > wrote:
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > > We upgraded from .94.0 to .94.2 last week and
> have
> > > > >> started to
> > > > >> >> > > > > encounter
> > > > >> >> > > > > > > > infinite loops of region-transition on splits.
> I'm
> > > not
> > > > >> sure
> > > > >> >> > yet
> > > > >> >> > > if
> > > > >> >> > > > > > it's
> > > > >> >> > > > > > > > all splits nor if it's related to load.  Solution
> > so
> > > > far
> > > > >> has
> > > > >> >> > been
> > > > >> >> > > > to
> > > > >> >> > > > > > > > restart the regionserver process.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > log snippet:
> > > > >> >> > > > > > > > http://pastebin.com/LpienZ7B
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > It's repeating these two lines:
> > > > >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
> > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Attempting
> to
> > > > >> transition
> > > > >> >> > > node
> > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> > > > RS_ZK_REGION_SPLIT
> > > > >> to
> > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> > > > >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
> > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Successfully
> > > > >> >> transitioned
> > > > >> >> > > node
> > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> > > > RS_ZK_REGION_SPLIT
> > > > >> to
> > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > with the occasional:
> > > > >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
> > > > >> >> > > > > > > >
> > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> > > > >> Still
> > > > >> >> > > > waiting
> > > > >> >> > > > > on
> > > > >> >> > > > > > > the
> > > > >> >> > > > > > > > master to process the split for
> > > > >> >> > cf3e9bc069e1888983c06dc8e053ffcf
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > Should the region transition from
> > RS_ZK_REGION_SPLIT
> > > to
> > > > >> >> itself?
> > > > >> >> > >  It
> > > > >> >> > > > > > looks
> > > > >> >> > > > > > > > wrong, but I'm not familiar with the code at all.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > Thanks,
> > > > >> >> > > > > > > > Matt
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > > >
> > > > >> >> > > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by Matt Corgan <mc...@hotpads.com>.

Yeah - we were running .94.0 since it came out but never saw it there.
 I'll keep trying to narrow it down.  The only harm it's causing is log
spam and failing to move daughters to a new regionserver, which are
definitely problems, but it's not bringing down the cluster.


On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <lh...@yahoo.com> wrote:

> So it seems you can repeat this to some extend in 0.94.2, but you have
> never seen this before?
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Matt Corgan <mc...@hotpads.com>
> To: dev <de...@hbase.apache.org>
> Sent: Monday, November 5, 2012 9:10 PM
> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>
> It happened in this new table that I have all the logs for.  The region in
> question this time is 6839663e4f8f79be3d7469784c21cbc2, and the first trace
> of this region is on the regionserver with the "Intantiated tableName..."
> message
>
> 2012-11-05 22:24:21,162 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
> Instantiated
>
> StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74
> 9784c21cbc2.
>
> I also know this region came from a recent split, but I can't find any log
> messages show the parent finishing the split that created this daughter
> region.  So my guess now is that the split is actually finishing and
> letting clients continue to write data, but something is failing to print
> the log line and correctly tell the master about the new region.
>
> I've noticed that these regions are showing up on the clients in calls to
> HTable.getRegionLocations(), so the clients continue to function, but if I
> call HBaseAdmin.move() i get an UnknownRegionException.
>
>
> On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > I think what Matt encountered was this:
> > https://issues.apache.org/jira/browse/HBASE-7101
> >
> > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <mc...@hotpads.com> wrote:
> >
> > > Beats me - i can't find any record of it before that.
> > >
> > > I'm importing data into another table now.  I disabled/enabled the
> table
> > > first to make sure we have the original 4 region locations logged
> > > everywhere.  Will report back...
> > >
> > >
> > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <
> jdcryans@apache.org
> > > >wrote:
> > >
> > > > Wow, and can you figure how this happened?
> > > >
> > > > 2012-11-05 00:24:54,538 DEBUG
> > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated
> > > >
> > > >
> > >
> >
> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.
> > > >
> > > > Split? Else how did the master told that RS to open a region that
> > doesn't
> > > > exist?
> > > >
> > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <mc...@hotpads.com>
> > wrote:
> > > > > Here's one from last night with the master logs at the bottom.
> > > > > http://pastebin.com/cSdMbA2a
> > > > >
> > > > > I don't see that region in the master logs for 4 days back.  I'm
> > going
> > > to
> > > > > import some new data from scratch soon and will be sure to keep all
> > the
> > > > > master logs.
> > > > >
> > > > >
> > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <
> > > jdcryans@apache.org
> > > > >wrote:
> > > > >
> > > > >> This reminds me a lot of
> > > > https://issues.apache.org/jira/browse/HBASE-4792
> > > > >>
> > > > >> What I'd like to see is the log from the first time the master
> > > > >> receives the split message. I guess it says the region doesn't
> exist
> > > > >> anymore because the split was processed already in the master but
> > > > >> there's a failure mode similar to 4792.
> > > > >>
> > > > >> I saw this on another cluster last week but the logs were rolling
> > > > >> based on size so the original split message was lost.
> > > > >>
> > > > >> J-D
> > > > >>
> > > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <mc...@hotpads.com>
> > > > wrote:
> > > > >> > I've had a few successful splits today without any errors.  I'll
> > > turn
> > > > up
> > > > >> > the importer speed tomorrow to start stressing the cluster more.
> > > > >> >
> > > > >> >
> > > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <yu...@gmail.com>
> > wrote:
> > > > >> >
> > > > >> >> Matt:
> > > > >> >> Was there any region server crash before you noticed the
> repeated
> > > log
> > > > >> for
> > > > >> >> bc62a8a72124a4ba3f6b9f302587903c ?
> > > > >> >>
> > > > >> >> I wonder if the following JIRA might be related:
> > > > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign missing
> > > daughter
> > > > >> >>
> > > > >> >> Thanks
> > > > >> >>
> > > > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <
> mcorgan@hotpads.com
> > >
> > > > >> wrote:
> > > > >> >>
> > > > >> >> > Strangely, I don't see any record of that region in the
> master
> > > > before
> > > > >> >> what
> > > > >> >> > I already pasted even though I have logs back to 10/30.  Next
> > > time
> > > > it
> > > > >> >> > happens I'll gather a full log record and try to debug while
> > it's
> > > > >> >> > occurring.
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
> > > > >> >> > chrajeshbabu32@gmail.com> wrote:
> > > > >> >> >
> > > > >> >> > > Hi Matt,
> > > > >> >> > > can you paste some more master logs of region
> > > > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
> > > > >> >> > > I think Its not problem with splitting.
> > > > >> >> > > We are getting
> > > > >> >> > >       LOG.warn("Region " + encodedName + " not found on
> > server
> > > "
> > > > +
> > > > >> >> > > serverName +
> > > > >> >> > >         "; failed processing");
> > > > >> >> > > this log means no entry in servers map(not fully assigned).
> > > > >> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
> > > > >> >> > >     HRegionInfo foundHri = null;
> > > > >> >> > >     for (HRegionInfo hri: hris) {
> > > > >> >> > >       if (hri.getEncodedName().equals(encodedName)) {
> > > > >> >> > >         foundHri = hri;
> > > > >> >> > >         break;
> > > > >> >> > >       }
> > > > >> >> > >     }
> > > > >> >> > >     return foundHri;
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <
> > > > lhofhansl@yahoo.com>
> > > > >> >> > wrote:
> > > > >> >> > >
> > > > >> >> > > > CC'ing dev list...
> > > > >> >> > > >
> > > > >> >> > > > Is anybody aware of any changes that went in recently
> that
> > > > could
> > > > >> >> cause
> > > > >> >> > > > this?
> > > > >> >> > > > I looked around a bit, but could not find anything
> obvious.
> > > > >> >> > > >
> > > > >> >> > > > -- Lars
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > > ________________________________
> > > > >> >> > > >  From: Matt Corgan <mc...@hotpads.com>
> > > > >> >> > > > To: user <us...@hbase.apache.org>
> > > > >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
> > > > >> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> > > > >> >> > > >
> > > > >> >> > > > I think the cluster is ok without running hbck, as
> > restarting
> > > > the
> > > > >> >> > > > regionserver process stops the warnings and everything
> > looks
> > > ok
> > > > >> >> > > otherwise.
> > > > >> >> > > >
> > > > >> >> > > > here's the regionserver right after the split happens:
> > > > >> >> > > > ------------------------
> > > > >> >> > > > 2012-11-01 22:45:28,726 DEBUG
> > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> > transition
> > > > node
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> > > > >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > > transitioned
> > > > >> node
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > > > >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> > Still
> > > > >> waiting
> > > > >> >> on
> > > > >> >> > > the
> > > > >> >> > > > master to process the split for
> > > > bc62a8a72124a4ba3f6b9f302587903c
> > > > >> >> > > > 2012-11-01 22:45:28,832 DEBUG
> > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> > transition
> > > > node
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> > > > >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
> > > > >> >> > > > 2012-11-01 22:45:28,837 DEBUG
> > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > > transitioned
> > > > >> node
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > > > >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
> > > > >> >> > > > -----------------------------
> > > > >> >> > > >
> > > > >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to
> > > > >> RS_ZK_REGION_SPLIT"
> > > > >> >> > > > continues for 15 or so hours and finally settles without
> > > manual
> > > > >> >> > > > intervention with these regionserver log messages:
> > > > >> >> > > > -----------------------
> > > > >> >> > > > 2012-11-02 13:55:00,906 DEBUG
> > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> > transition
> > > > >> node *
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT
> > to
> > > > >> >> > > > RS_ZK_REGION_SPLIT
> > > > >> >> > > >
> > > > >> >> > > > 2012-11-02 13:55:00,916 DEBUG
> > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > > transitioned
> > > > >> node *
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT
> > to
> > > > >> >> > > > RS_ZK_REGION_SPLIT
> > > > >> >> > > >
> > > > >> >> > > > 2012-11-02 13:55:00,916 INFO
> > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region
> > > > split,
> > > > >> META
> > > > >> >> > > > updated, and report to master.
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> > > > >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> > > > >> >> > > > Split took 15hrs, 9mins, 33sec
> > > > >> >> > > >
> > > > >> >> > > > 2012-11-02 13:55:00,945 INFO
> > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region
> > > > split,
> > > > >> META
> > > > >> >> > > > updated, and report to master.
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> > > > >> >> > > > new regions:
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> > > > >> >> > > >
> > > > >> >> >
> > > > >>
> > >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs,
> > 14mins,
> > > > 2sec
> > > > >> >> > > > ----------------------
> > > > >> >> > > >
> > > > >> >> > > > And the master finally logs this:
> > > > >> >> > > > -----------------------
> > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on server
> > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279; failed
> > > > processing
> > > > >> >> > > >
> > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager:
> Received
> > > > SPLIT
> > > > >> for
> > > > >> >> > > region
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it
> > doesn't
> > > > exist
> > > > >> >> > > anymore,
> > > > >> >> > > > probably already processed its split
> > > > >> >> > > > ------------------------
> > > > >> >> > > >
> > > > >> >> > > > I can't find any evidence that the region left the node
> > > during
> > > > >> that
> > > > >> >> > time.
> > > > >> >> > > > I'll have to catch it in action next time and see what
> the
> > > > region
> > > > >> is
> > > > >> >> up
> > > > >> >> > > to
> > > > >> >> > > > during the problem period.
> > > > >> >> > > >
> > > > >> >> > > > What does it mean that it successfully transitioned from
> > > SPLIT
> > > > to
> > > > >> >> > SPLIT?
> > > > >> >> > > > Is that a valid transition?
> > > > >> >> > > >
> > > > >> >> > > > Matt
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <
> > yuzhihong@gmail.com>
> > > > >> wrote:
> > > > >> >> > > >
> > > > >> >> > > > > Matt:
> > > > >> >> > > > > From the following we can see that region
> > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c
> > > > >> >> > > > > cannot be found:
> > > > >> >> > > > >
> > > > >> >> > > > > 2012-11-02 00:00:02,909 DEBUG
> > > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
> > Handling
> > > > >> >> > > > > transition=RS_ZK_REGION_SPLIT,
> > > > >> >> > > > > server=HadoopNode162.hotpads.srv,60020,1351788248279,
> > > > >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
> > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> > > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
> Region
> > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on server
> > > > >> >> > > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> > > > >> >> > > > > processing
> > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> > > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
> > Received
> > > > SPLIT
> > > > >> >> for
> > > > >> >> > > > region
> > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > > >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it
> > > doesn't
> > > > >> exist
> > > > >> >> > > > anymore,
> > > > >> >> > > > > probably already processed its split
> > > > >> >> > > > >
> > > > >> >> > > > > Have you run hbck to repair your cluster ?
> > > > >> >> > > > >
> > > > >> >> > > > > Thanks
> > > > >> >> > > > >
> > > > >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
> > > > >> mcorgan@hotpads.com>
> > > > >> >> > > wrote:
> > > > >> >> > > > >
> > > > >> >> > > > > > Here's a sample of the master's logs from yesterday.
> > >  It's
> > > > not
> > > > >> >> > > > correlated
> > > > >> >> > > > > > exactly with the other pastebin log, but there's 3GB
> of
> > > > this
> > > > >> from
> > > > >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
> > > > >> >> > > > > >
> > > > >> >> > > > > > I'm am pushing the cluster a bit with importing data
> so
> > > > >> testing
> > > > >> >> the
> > > > >> >> > > > split
> > > > >> >> > > > > > code harder than normal.  The regions are 500-1GB
> > > gzipped.
> > > >  I
> > > > >> can
> > > > >> >> > > look
> > > > >> >> > > > > into
> > > > >> >> > > > > > it more but trying to figure out what to look for.
> > > > >> >> > > > > >
> > > > >> >> > > > > > Thanks Ted,
> > > > >> >> > > > > > Matt
> > > > >> >> > > > > >
> > > > >> >> > > > > >
> > > > >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <
> > > > yuzhihong@gmail.com>
> > > > >> >> > wrote:
> > > > >> >> > > > > >
> > > > >> >> > > > > > > Matt:
> > > > >> >> > > > > > > This is the method which made the logging:
> > > > >> >> > > > > > >   private static int
> tickleNodeSplit(ZooKeeperWatcher
> > > > zkw,
> > > > >> >> > > > > > >       HRegionInfo parent, HRegionInfo a,
> HRegionInfo
> > b,
> > > > >> >> > ServerName
> > > > >> >> > > > > > > serverName,
> > > > >> >> > > > > > >       final int znodeVersion)
> > > > >> >> > > > > > >   throws KeeperException, IOException {
> > > > >> >> > > > > > >     byte [] payload = Writables.getBytes(a, b);
> > > > >> >> > > > > > >     return ZKAssign.transitionNode(zkw, parent,
> > > > serverName,
> > > > >> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
> > > > >> >> EventType.RS_ZK_REGION_SPLIT,
> > > > >> >> > > > > > >       znodeVersion, payload);
> > > > >> >> > > > > > >   }
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > transitionZKNode() calls tickleNodeSplit() when
> > waiting
> > > > for
> > > > >> >> > master
> > > > >> >> > > to
> > > > >> >> > > > > > split
> > > > >> >> > > > > > > the region. Obviously something caused the master
> not
> > > > able
> > > > >> to
> > > > >> >> > > split.
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > How large is the region ?
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > Can you pastebin master log for that period of
> time ?
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > Thanks
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <
> > > > >> >> mcorgan@hotpads.com
> > > > >> >> > >
> > > > >> >> > > > > wrote:
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > > We upgraded from .94.0 to .94.2 last week and
> have
> > > > >> started to
> > > > >> >> > > > > encounter
> > > > >> >> > > > > > > > infinite loops of region-transition on splits.
> I'm
> > > not
> > > > >> sure
> > > > >> >> > yet
> > > > >> >> > > if
> > > > >> >> > > > > > it's
> > > > >> >> > > > > > > > all splits nor if it's related to load.  Solution
> > so
> > > > far
> > > > >> has
> > > > >> >> > been
> > > > >> >> > > > to
> > > > >> >> > > > > > > > restart the regionserver process.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > log snippet:
> > > > >> >> > > > > > > > http://pastebin.com/LpienZ7B
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > It's repeating these two lines:
> > > > >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
> > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Attempting
> to
> > > > >> transition
> > > > >> >> > > node
> > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> > > > RS_ZK_REGION_SPLIT
> > > > >> to
> > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> > > > >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
> > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Successfully
> > > > >> >> transitioned
> > > > >> >> > > node
> > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> > > > RS_ZK_REGION_SPLIT
> > > > >> to
> > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > with the occasional:
> > > > >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
> > > > >> >> > > > > > > >
> > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> > > > >> Still
> > > > >> >> > > > waiting
> > > > >> >> > > > > on
> > > > >> >> > > > > > > the
> > > > >> >> > > > > > > > master to process the split for
> > > > >> >> > cf3e9bc069e1888983c06dc8e053ffcf
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > Should the region transition from
> > RS_ZK_REGION_SPLIT
> > > to
> > > > >> >> itself?
> > > > >> >> > >  It
> > > > >> >> > > > > > looks
> > > > >> >> > > > > > > > wrong, but I'm not familiar with the code at all.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > Thanks,
> > > > >> >> > > > > > > > Matt
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > > >
> > > > >> >> > > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> > >
> >
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by lars hofhansl <lh...@yahoo.com>.

So it seems you can repeat this to some extend in 0.94.2, but you have never seen this before?


-- Lars



________________________________
 From: Matt Corgan <mc...@hotpads.com>
To: dev <de...@hbase.apache.org> 
Sent: Monday, November 5, 2012 9:10 PM
Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
 
It happened in this new table that I have all the logs for.  The region in
question this time is 6839663e4f8f79be3d7469784c21cbc2, and the first trace
of this region is on the regionserver with the "Intantiated tableName..."
message

2012-11-05 22:24:21,162 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Instantiated
StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74
9784c21cbc2.

I also know this region came from a recent split, but I can't find any log
messages show the parent finishing the split that created this daughter
region.  So my guess now is that the split is actually finishing and
letting clients continue to write data, but something is failing to print
the log line and correctly tell the master about the new region.

I've noticed that these regions are showing up on the clients in calls to
HTable.getRegionLocations(), so the clients continue to function, but if I
call HBaseAdmin.move() i get an UnknownRegionException.


On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <yu...@gmail.com> wrote:

> I think what Matt encountered was this:
> https://issues.apache.org/jira/browse/HBASE-7101
>
> On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
> > Beats me - i can't find any record of it before that.
> >
> > I'm importing data into another table now.  I disabled/enabled the table
> > first to make sure we have the original 4 region locations logged
> > everywhere.  Will report back...
> >
> >
> > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <jdcryans@apache.org
> > >wrote:
> >
> > > Wow, and can you figure how this happened?
> > >
> > > 2012-11-05 00:24:54,538 DEBUG
> > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated
> > >
> > >
> >
> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.
> > >
> > > Split? Else how did the master told that RS to open a region that
> doesn't
> > > exist?
> > >
> > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <mc...@hotpads.com>
> wrote:
> > > > Here's one from last night with the master logs at the bottom.
> > > > http://pastebin.com/cSdMbA2a
> > > >
> > > > I don't see that region in the master logs for 4 days back.  I'm
> going
> > to
> > > > import some new data from scratch soon and will be sure to keep all
> the
> > > > master logs.
> > > >
> > > >
> > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <
> > jdcryans@apache.org
> > > >wrote:
> > > >
> > > >> This reminds me a lot of
> > > https://issues.apache.org/jira/browse/HBASE-4792
> > > >>
> > > >> What I'd like to see is the log from the first time the master
> > > >> receives the split message. I guess it says the region doesn't exist
> > > >> anymore because the split was processed already in the master but
> > > >> there's a failure mode similar to 4792.
> > > >>
> > > >> I saw this on another cluster last week but the logs were rolling
> > > >> based on size so the original split message was lost.
> > > >>
> > > >> J-D
> > > >>
> > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <mc...@hotpads.com>
> > > wrote:
> > > >> > I've had a few successful splits today without any errors.  I'll
> > turn
> > > up
> > > >> > the importer speed tomorrow to start stressing the cluster more.
> > > >> >
> > > >> >
> > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > > >> >
> > > >> >> Matt:
> > > >> >> Was there any region server crash before you noticed the repeated
> > log
> > > >> for
> > > >> >> bc62a8a72124a4ba3f6b9f302587903c ?
> > > >> >>
> > > >> >> I wonder if the following JIRA might be related:
> > > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign missing
> > daughter
> > > >> >>
> > > >> >> Thanks
> > > >> >>
> > > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <mcorgan@hotpads.com
> >
> > > >> wrote:
> > > >> >>
> > > >> >> > Strangely, I don't see any record of that region in the master
> > > before
> > > >> >> what
> > > >> >> > I already pasted even though I have logs back to 10/30.  Next
> > time
> > > it
> > > >> >> > happens I'll gather a full log record and try to debug while
> it's
> > > >> >> > occurring.
> > > >> >> >
> > > >> >> >
> > > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
> > > >> >> > chrajeshbabu32@gmail.com> wrote:
> > > >> >> >
> > > >> >> > > Hi Matt,
> > > >> >> > > can you paste some more master logs of region
> > > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
> > > >> >> > > I think Its not problem with splitting.
> > > >> >> > > We are getting
> > > >> >> > >       LOG.warn("Region " + encodedName + " not found on
> server
> > "
> > > +
> > > >> >> > > serverName +
> > > >> >> > >         "; failed processing");
> > > >> >> > > this log means no entry in servers map(not fully assigned).
> > > >> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
> > > >> >> > >     HRegionInfo foundHri = null;
> > > >> >> > >     for (HRegionInfo hri: hris) {
> > > >> >> > >       if (hri.getEncodedName().equals(encodedName)) {
> > > >> >> > >         foundHri = hri;
> > > >> >> > >         break;
> > > >> >> > >       }
> > > >> >> > >     }
> > > >> >> > >     return foundHri;
> > > >> >> > >
> > > >> >> > >
> > > >> >> > >
> > > >> >> > >
> > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <
> > > lhofhansl@yahoo.com>
> > > >> >> > wrote:
> > > >> >> > >
> > > >> >> > > > CC'ing dev list...
> > > >> >> > > >
> > > >> >> > > > Is anybody aware of any changes that went in recently that
> > > could
> > > >> >> cause
> > > >> >> > > > this?
> > > >> >> > > > I looked around a bit, but could not find anything obvious.
> > > >> >> > > >
> > > >> >> > > > -- Lars
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > > ________________________________
> > > >> >> > > >  From: Matt Corgan <mc...@hotpads.com>
> > > >> >> > > > To: user <us...@hbase.apache.org>
> > > >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
> > > >> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> > > >> >> > > >
> > > >> >> > > > I think the cluster is ok without running hbck, as
> restarting
> > > the
> > > >> >> > > > regionserver process stops the warnings and everything
> looks
> > ok
> > > >> >> > > otherwise.
> > > >> >> > > >
> > > >> >> > > > here's the regionserver right after the split happens:
> > > >> >> > > > ------------------------
> > > >> >> > > > 2012-11-01 22:45:28,726 DEBUG
> > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> transition
> > > node
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> > > >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > transitioned
> > > >> node
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > > >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> Still
> > > >> waiting
> > > >> >> on
> > > >> >> > > the
> > > >> >> > > > master to process the split for
> > > bc62a8a72124a4ba3f6b9f302587903c
> > > >> >> > > > 2012-11-01 22:45:28,832 DEBUG
> > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> transition
> > > node
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> > > >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
> > > >> >> > > > 2012-11-01 22:45:28,837 DEBUG
> > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > transitioned
> > > >> node
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > > >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
> > > >> >> > > > -----------------------------
> > > >> >> > > >
> > > >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to
> > > >> RS_ZK_REGION_SPLIT"
> > > >> >> > > > continues for 15 or so hours and finally settles without
> > manual
> > > >> >> > > > intervention with these regionserver log messages:
> > > >> >> > > > -----------------------
> > > >> >> > > > 2012-11-02 13:55:00,906 DEBUG
> > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> transition
> > > >> node *
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT
> to
> > > >> >> > > > RS_ZK_REGION_SPLIT
> > > >> >> > > >
> > > >> >> > > > 2012-11-02 13:55:00,916 DEBUG
> > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > transitioned
> > > >> node *
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT
> to
> > > >> >> > > > RS_ZK_REGION_SPLIT
> > > >> >> > > >
> > > >> >> > > > 2012-11-02 13:55:00,916 INFO
> > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region
> > > split,
> > > >> META
> > > >> >> > > > updated, and report to master.
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> > > >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> > > >> >> > > > Split took 15hrs, 9mins, 33sec
> > > >> >> > > >
> > > >> >> > > > 2012-11-02 13:55:00,945 INFO
> > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region
> > > split,
> > > >> META
> > > >> >> > > > updated, and report to master.
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> > > >> >> > > > new regions:
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> > > >> >> > > >
> > > >> >> >
> > > >>
> > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs,
> 14mins,
> > > 2sec
> > > >> >> > > > ----------------------
> > > >> >> > > >
> > > >> >> > > > And the master finally logs this:
> > > >> >> > > > -----------------------
> > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on server
> > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279; failed
> > > processing
> > > >> >> > > >
> > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Received
> > > SPLIT
> > > >> for
> > > >> >> > > region
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it
> doesn't
> > > exist
> > > >> >> > > anymore,
> > > >> >> > > > probably already processed its split
> > > >> >> > > > ------------------------
> > > >> >> > > >
> > > >> >> > > > I can't find any evidence that the region left the node
> > during
> > > >> that
> > > >> >> > time.
> > > >> >> > > > I'll have to catch it in action next time and see what the
> > > region
> > > >> is
> > > >> >> up
> > > >> >> > > to
> > > >> >> > > > during the problem period.
> > > >> >> > > >
> > > >> >> > > > What does it mean that it successfully transitioned from
> > SPLIT
> > > to
> > > >> >> > SPLIT?
> > > >> >> > > > Is that a valid transition?
> > > >> >> > > >
> > > >> >> > > > Matt
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <
> yuzhihong@gmail.com>
> > > >> wrote:
> > > >> >> > > >
> > > >> >> > > > > Matt:
> > > >> >> > > > > From the following we can see that region
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c
> > > >> >> > > > > cannot be found:
> > > >> >> > > > >
> > > >> >> > > > > 2012-11-02 00:00:02,909 DEBUG
> > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
> Handling
> > > >> >> > > > > transition=RS_ZK_REGION_SPLIT,
> > > >> >> > > > > server=HadoopNode162.hotpads.srv,60020,1351788248279,
> > > >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
> > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on server
> > > >> >> > > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> > > >> >> > > > > processing
> > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
> Received
> > > SPLIT
> > > >> >> for
> > > >> >> > > > region
> > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it
> > doesn't
> > > >> exist
> > > >> >> > > > anymore,
> > > >> >> > > > > probably already processed its split
> > > >> >> > > > >
> > > >> >> > > > > Have you run hbck to repair your cluster ?
> > > >> >> > > > >
> > > >> >> > > > > Thanks
> > > >> >> > > > >
> > > >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
> > > >> mcorgan@hotpads.com>
> > > >> >> > > wrote:
> > > >> >> > > > >
> > > >> >> > > > > > Here's a sample of the master's logs from yesterday.
> >  It's
> > > not
> > > >> >> > > > correlated
> > > >> >> > > > > > exactly with the other pastebin log, but there's 3GB of
> > > this
> > > >> from
> > > >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
> > > >> >> > > > > >
> > > >> >> > > > > > I'm am pushing the cluster a bit with importing data so
> > > >> testing
> > > >> >> the
> > > >> >> > > > split
> > > >> >> > > > > > code harder than normal.  The regions are 500-1GB
> > gzipped.
> > >  I
> > > >> can
> > > >> >> > > look
> > > >> >> > > > > into
> > > >> >> > > > > > it more but trying to figure out what to look for.
> > > >> >> > > > > >
> > > >> >> > > > > > Thanks Ted,
> > > >> >> > > > > > Matt
> > > >> >> > > > > >
> > > >> >> > > > > >
> > > >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <
> > > yuzhihong@gmail.com>
> > > >> >> > wrote:
> > > >> >> > > > > >
> > > >> >> > > > > > > Matt:
> > > >> >> > > > > > > This is the method which made the logging:
> > > >> >> > > > > > >   private static int tickleNodeSplit(ZooKeeperWatcher
> > > zkw,
> > > >> >> > > > > > >       HRegionInfo parent, HRegionInfo a, HRegionInfo
> b,
> > > >> >> > ServerName
> > > >> >> > > > > > > serverName,
> > > >> >> > > > > > >       final int znodeVersion)
> > > >> >> > > > > > >   throws KeeperException, IOException {
> > > >> >> > > > > > >     byte [] payload = Writables.getBytes(a, b);
> > > >> >> > > > > > >     return ZKAssign.transitionNode(zkw, parent,
> > > serverName,
> > > >> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
> > > >> >> EventType.RS_ZK_REGION_SPLIT,
> > > >> >> > > > > > >       znodeVersion, payload);
> > > >> >> > > > > > >   }
> > > >> >> > > > > > >
> > > >> >> > > > > > > transitionZKNode() calls tickleNodeSplit() when
> waiting
> > > for
> > > >> >> > master
> > > >> >> > > to
> > > >> >> > > > > > split
> > > >> >> > > > > > > the region. Obviously something caused the master not
> > > able
> > > >> to
> > > >> >> > > split.
> > > >> >> > > > > > >
> > > >> >> > > > > > > How large is the region ?
> > > >> >> > > > > > >
> > > >> >> > > > > > > Can you pastebin master log for that period of time ?
> > > >> >> > > > > > >
> > > >> >> > > > > > > Thanks
> > > >> >> > > > > > >
> > > >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <
> > > >> >> mcorgan@hotpads.com
> > > >> >> > >
> > > >> >> > > > > wrote:
> > > >> >> > > > > > >
> > > >> >> > > > > > > > We upgraded from .94.0 to .94.2 last week and have
> > > >> started to
> > > >> >> > > > > encounter
> > > >> >> > > > > > > > infinite loops of region-transition on splits.  I'm
> > not
> > > >> sure
> > > >> >> > yet
> > > >> >> > > if
> > > >> >> > > > > > it's
> > > >> >> > > > > > > > all splits nor if it's related to load.  Solution
> so
> > > far
> > > >> has
> > > >> >> > been
> > > >> >> > > > to
> > > >> >> > > > > > > > restart the regionserver process.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > log snippet:
> > > >> >> > > > > > > > http://pastebin.com/LpienZ7B
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > It's repeating these two lines:
> > > >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
> > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Attempting to
> > > >> transition
> > > >> >> > > node
> > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> > > RS_ZK_REGION_SPLIT
> > > >> to
> > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> > > >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
> > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Successfully
> > > >> >> transitioned
> > > >> >> > > node
> > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> > > RS_ZK_REGION_SPLIT
> > > >> to
> > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > with the occasional:
> > > >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
> > > >> >> > > > > > > >
> > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> > > >> Still
> > > >> >> > > > waiting
> > > >> >> > > > > on
> > > >> >> > > > > > > the
> > > >> >> > > > > > > > master to process the split for
> > > >> >> > cf3e9bc069e1888983c06dc8e053ffcf
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > Should the region transition from
> RS_ZK_REGION_SPLIT
> > to
> > > >> >> itself?
> > > >> >> > >  It
> > > >> >> > > > > > looks
> > > >> >> > > > > > > > wrong, but I'm not familiar with the code at all.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > Thanks,
> > > >> >> > > > > > > > Matt
> > > >> >> > > > > > > >
> > > >> >> > > > > > >
> > > >> >> > > > > >
> > > >> >> > > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by Matt Corgan <mc...@hotpads.com>.

It happened in this new table that I have all the logs for.  The region in
question this time is 6839663e4f8f79be3d7469784c21cbc2, and the first trace
of this region is on the regionserver with the "Intantiated tableName..."
message

2012-11-05 22:24:21,162 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Instantiated
StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74
9784c21cbc2.

I also know this region came from a recent split, but I can't find any log
messages show the parent finishing the split that created this daughter
region.  So my guess now is that the split is actually finishing and
letting clients continue to write data, but something is failing to print
the log line and correctly tell the master about the new region.

I've noticed that these regions are showing up on the clients in calls to
HTable.getRegionLocations(), so the clients continue to function, but if I
call HBaseAdmin.move() i get an UnknownRegionException.


On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <yu...@gmail.com> wrote:

> I think what Matt encountered was this:
> https://issues.apache.org/jira/browse/HBASE-7101
>
> On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
> > Beats me - i can't find any record of it before that.
> >
> > I'm importing data into another table now.  I disabled/enabled the table
> > first to make sure we have the original 4 region locations logged
> > everywhere.  Will report back...
> >
> >
> > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <jdcryans@apache.org
> > >wrote:
> >
> > > Wow, and can you figure how this happened?
> > >
> > > 2012-11-05 00:24:54,538 DEBUG
> > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated
> > >
> > >
> >
> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.
> > >
> > > Split? Else how did the master told that RS to open a region that
> doesn't
> > > exist?
> > >
> > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <mc...@hotpads.com>
> wrote:
> > > > Here's one from last night with the master logs at the bottom.
> > > > http://pastebin.com/cSdMbA2a
> > > >
> > > > I don't see that region in the master logs for 4 days back.  I'm
> going
> > to
> > > > import some new data from scratch soon and will be sure to keep all
> the
> > > > master logs.
> > > >
> > > >
> > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <
> > jdcryans@apache.org
> > > >wrote:
> > > >
> > > >> This reminds me a lot of
> > > https://issues.apache.org/jira/browse/HBASE-4792
> > > >>
> > > >> What I'd like to see is the log from the first time the master
> > > >> receives the split message. I guess it says the region doesn't exist
> > > >> anymore because the split was processed already in the master but
> > > >> there's a failure mode similar to 4792.
> > > >>
> > > >> I saw this on another cluster last week but the logs were rolling
> > > >> based on size so the original split message was lost.
> > > >>
> > > >> J-D
> > > >>
> > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <mc...@hotpads.com>
> > > wrote:
> > > >> > I've had a few successful splits today without any errors.  I'll
> > turn
> > > up
> > > >> > the importer speed tomorrow to start stressing the cluster more.
> > > >> >
> > > >> >
> > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > > >> >
> > > >> >> Matt:
> > > >> >> Was there any region server crash before you noticed the repeated
> > log
> > > >> for
> > > >> >> bc62a8a72124a4ba3f6b9f302587903c ?
> > > >> >>
> > > >> >> I wonder if the following JIRA might be related:
> > > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign missing
> > daughter
> > > >> >>
> > > >> >> Thanks
> > > >> >>
> > > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <mcorgan@hotpads.com
> >
> > > >> wrote:
> > > >> >>
> > > >> >> > Strangely, I don't see any record of that region in the master
> > > before
> > > >> >> what
> > > >> >> > I already pasted even though I have logs back to 10/30.  Next
> > time
> > > it
> > > >> >> > happens I'll gather a full log record and try to debug while
> it's
> > > >> >> > occurring.
> > > >> >> >
> > > >> >> >
> > > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
> > > >> >> > chrajeshbabu32@gmail.com> wrote:
> > > >> >> >
> > > >> >> > > Hi Matt,
> > > >> >> > > can you paste some more master logs of region
> > > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
> > > >> >> > > I think Its not problem with splitting.
> > > >> >> > > We are getting
> > > >> >> > >       LOG.warn("Region " + encodedName + " not found on
> server
> > "
> > > +
> > > >> >> > > serverName +
> > > >> >> > >         "; failed processing");
> > > >> >> > > this log means no entry in servers map(not fully assigned).
> > > >> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
> > > >> >> > >     HRegionInfo foundHri = null;
> > > >> >> > >     for (HRegionInfo hri: hris) {
> > > >> >> > >       if (hri.getEncodedName().equals(encodedName)) {
> > > >> >> > >         foundHri = hri;
> > > >> >> > >         break;
> > > >> >> > >       }
> > > >> >> > >     }
> > > >> >> > >     return foundHri;
> > > >> >> > >
> > > >> >> > >
> > > >> >> > >
> > > >> >> > >
> > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <
> > > lhofhansl@yahoo.com>
> > > >> >> > wrote:
> > > >> >> > >
> > > >> >> > > > CC'ing dev list...
> > > >> >> > > >
> > > >> >> > > > Is anybody aware of any changes that went in recently that
> > > could
> > > >> >> cause
> > > >> >> > > > this?
> > > >> >> > > > I looked around a bit, but could not find anything obvious.
> > > >> >> > > >
> > > >> >> > > > -- Lars
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > > ________________________________
> > > >> >> > > >  From: Matt Corgan <mc...@hotpads.com>
> > > >> >> > > > To: user <us...@hbase.apache.org>
> > > >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
> > > >> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> > > >> >> > > >
> > > >> >> > > > I think the cluster is ok without running hbck, as
> restarting
> > > the
> > > >> >> > > > regionserver process stops the warnings and everything
> looks
> > ok
> > > >> >> > > otherwise.
> > > >> >> > > >
> > > >> >> > > > here's the regionserver right after the split happens:
> > > >> >> > > > ------------------------
> > > >> >> > > > 2012-11-01 22:45:28,726 DEBUG
> > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> transition
> > > node
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> > > >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > transitioned
> > > >> node
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > > >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> Still
> > > >> waiting
> > > >> >> on
> > > >> >> > > the
> > > >> >> > > > master to process the split for
> > > bc62a8a72124a4ba3f6b9f302587903c
> > > >> >> > > > 2012-11-01 22:45:28,832 DEBUG
> > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> transition
> > > node
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> > > >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
> > > >> >> > > > 2012-11-01 22:45:28,837 DEBUG
> > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > transitioned
> > > >> node
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > > >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
> > > >> >> > > > -----------------------------
> > > >> >> > > >
> > > >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to
> > > >> RS_ZK_REGION_SPLIT"
> > > >> >> > > > continues for 15 or so hours and finally settles without
> > manual
> > > >> >> > > > intervention with these regionserver log messages:
> > > >> >> > > > -----------------------
> > > >> >> > > > 2012-11-02 13:55:00,906 DEBUG
> > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> transition
> > > >> node *
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT
> to
> > > >> >> > > > RS_ZK_REGION_SPLIT
> > > >> >> > > >
> > > >> >> > > > 2012-11-02 13:55:00,916 DEBUG
> > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > transitioned
> > > >> node *
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT
> to
> > > >> >> > > > RS_ZK_REGION_SPLIT
> > > >> >> > > >
> > > >> >> > > > 2012-11-02 13:55:00,916 INFO
> > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region
> > > split,
> > > >> META
> > > >> >> > > > updated, and report to master.
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> > > >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> > > >> >> > > > Split took 15hrs, 9mins, 33sec
> > > >> >> > > >
> > > >> >> > > > 2012-11-02 13:55:00,945 INFO
> > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region
> > > split,
> > > >> META
> > > >> >> > > > updated, and report to master.
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> > > >> >> > > > new regions:
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> > > >> >> > > >
> > > >> >> >
> > > >>
> > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs,
> 14mins,
> > > 2sec
> > > >> >> > > > ----------------------
> > > >> >> > > >
> > > >> >> > > > And the master finally logs this:
> > > >> >> > > > -----------------------
> > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on server
> > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279; failed
> > > processing
> > > >> >> > > >
> > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Received
> > > SPLIT
> > > >> for
> > > >> >> > > region
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it
> doesn't
> > > exist
> > > >> >> > > anymore,
> > > >> >> > > > probably already processed its split
> > > >> >> > > > ------------------------
> > > >> >> > > >
> > > >> >> > > > I can't find any evidence that the region left the node
> > during
> > > >> that
> > > >> >> > time.
> > > >> >> > > > I'll have to catch it in action next time and see what the
> > > region
> > > >> is
> > > >> >> up
> > > >> >> > > to
> > > >> >> > > > during the problem period.
> > > >> >> > > >
> > > >> >> > > > What does it mean that it successfully transitioned from
> > SPLIT
> > > to
> > > >> >> > SPLIT?
> > > >> >> > > > Is that a valid transition?
> > > >> >> > > >
> > > >> >> > > > Matt
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <
> yuzhihong@gmail.com>
> > > >> wrote:
> > > >> >> > > >
> > > >> >> > > > > Matt:
> > > >> >> > > > > From the following we can see that region
> > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c
> > > >> >> > > > > cannot be found:
> > > >> >> > > > >
> > > >> >> > > > > 2012-11-02 00:00:02,909 DEBUG
> > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
> Handling
> > > >> >> > > > > transition=RS_ZK_REGION_SPLIT,
> > > >> >> > > > > server=HadoopNode162.hotpads.srv,60020,1351788248279,
> > > >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
> > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on server
> > > >> >> > > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> > > >> >> > > > > processing
> > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
> Received
> > > SPLIT
> > > >> >> for
> > > >> >> > > > region
> > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it
> > doesn't
> > > >> exist
> > > >> >> > > > anymore,
> > > >> >> > > > > probably already processed its split
> > > >> >> > > > >
> > > >> >> > > > > Have you run hbck to repair your cluster ?
> > > >> >> > > > >
> > > >> >> > > > > Thanks
> > > >> >> > > > >
> > > >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
> > > >> mcorgan@hotpads.com>
> > > >> >> > > wrote:
> > > >> >> > > > >
> > > >> >> > > > > > Here's a sample of the master's logs from yesterday.
> >  It's
> > > not
> > > >> >> > > > correlated
> > > >> >> > > > > > exactly with the other pastebin log, but there's 3GB of
> > > this
> > > >> from
> > > >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
> > > >> >> > > > > >
> > > >> >> > > > > > I'm am pushing the cluster a bit with importing data so
> > > >> testing
> > > >> >> the
> > > >> >> > > > split
> > > >> >> > > > > > code harder than normal.  The regions are 500-1GB
> > gzipped.
> > >  I
> > > >> can
> > > >> >> > > look
> > > >> >> > > > > into
> > > >> >> > > > > > it more but trying to figure out what to look for.
> > > >> >> > > > > >
> > > >> >> > > > > > Thanks Ted,
> > > >> >> > > > > > Matt
> > > >> >> > > > > >
> > > >> >> > > > > >
> > > >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <
> > > yuzhihong@gmail.com>
> > > >> >> > wrote:
> > > >> >> > > > > >
> > > >> >> > > > > > > Matt:
> > > >> >> > > > > > > This is the method which made the logging:
> > > >> >> > > > > > >   private static int tickleNodeSplit(ZooKeeperWatcher
> > > zkw,
> > > >> >> > > > > > >       HRegionInfo parent, HRegionInfo a, HRegionInfo
> b,
> > > >> >> > ServerName
> > > >> >> > > > > > > serverName,
> > > >> >> > > > > > >       final int znodeVersion)
> > > >> >> > > > > > >   throws KeeperException, IOException {
> > > >> >> > > > > > >     byte [] payload = Writables.getBytes(a, b);
> > > >> >> > > > > > >     return ZKAssign.transitionNode(zkw, parent,
> > > serverName,
> > > >> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
> > > >> >> EventType.RS_ZK_REGION_SPLIT,
> > > >> >> > > > > > >       znodeVersion, payload);
> > > >> >> > > > > > >   }
> > > >> >> > > > > > >
> > > >> >> > > > > > > transitionZKNode() calls tickleNodeSplit() when
> waiting
> > > for
> > > >> >> > master
> > > >> >> > > to
> > > >> >> > > > > > split
> > > >> >> > > > > > > the region. Obviously something caused the master not
> > > able
> > > >> to
> > > >> >> > > split.
> > > >> >> > > > > > >
> > > >> >> > > > > > > How large is the region ?
> > > >> >> > > > > > >
> > > >> >> > > > > > > Can you pastebin master log for that period of time ?
> > > >> >> > > > > > >
> > > >> >> > > > > > > Thanks
> > > >> >> > > > > > >
> > > >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <
> > > >> >> mcorgan@hotpads.com
> > > >> >> > >
> > > >> >> > > > > wrote:
> > > >> >> > > > > > >
> > > >> >> > > > > > > > We upgraded from .94.0 to .94.2 last week and have
> > > >> started to
> > > >> >> > > > > encounter
> > > >> >> > > > > > > > infinite loops of region-transition on splits.  I'm
> > not
> > > >> sure
> > > >> >> > yet
> > > >> >> > > if
> > > >> >> > > > > > it's
> > > >> >> > > > > > > > all splits nor if it's related to load.  Solution
> so
> > > far
> > > >> has
> > > >> >> > been
> > > >> >> > > > to
> > > >> >> > > > > > > > restart the regionserver process.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > log snippet:
> > > >> >> > > > > > > > http://pastebin.com/LpienZ7B
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > It's repeating these two lines:
> > > >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
> > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Attempting to
> > > >> transition
> > > >> >> > > node
> > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> > > RS_ZK_REGION_SPLIT
> > > >> to
> > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> > > >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
> > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Successfully
> > > >> >> transitioned
> > > >> >> > > node
> > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> > > RS_ZK_REGION_SPLIT
> > > >> to
> > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > with the occasional:
> > > >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
> > > >> >> > > > > > > >
> > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> > > >> Still
> > > >> >> > > > waiting
> > > >> >> > > > > on
> > > >> >> > > > > > > the
> > > >> >> > > > > > > > master to process the split for
> > > >> >> > cf3e9bc069e1888983c06dc8e053ffcf
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > Should the region transition from
> RS_ZK_REGION_SPLIT
> > to
> > > >> >> itself?
> > > >> >> > >  It
> > > >> >> > > > > > looks
> > > >> >> > > > > > > > wrong, but I'm not familiar with the code at all.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > Thanks,
> > > >> >> > > > > > > > Matt
> > > >> >> > > > > > > >
> > > >> >> > > > > > >
> > > >> >> > > > > >
> > > >> >> > > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by Ted Yu <yu...@gmail.com>.

I think what Matt encountered was this:
https://issues.apache.org/jira/browse/HBASE-7101

On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <mc...@hotpads.com> wrote:

> Beats me - i can't find any record of it before that.
>
> I'm importing data into another table now.  I disabled/enabled the table
> first to make sure we have the original 4 region locations logged
> everywhere.  Will report back...
>
>
> On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
>
> > Wow, and can you figure how this happened?
> >
> > 2012-11-05 00:24:54,538 DEBUG
> > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated
> >
> >
> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.
> >
> > Split? Else how did the master told that RS to open a region that doesn't
> > exist?
> >
> > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <mc...@hotpads.com> wrote:
> > > Here's one from last night with the master logs at the bottom.
> > > http://pastebin.com/cSdMbA2a
> > >
> > > I don't see that region in the master logs for 4 days back.  I'm going
> to
> > > import some new data from scratch soon and will be sure to keep all the
> > > master logs.
> > >
> > >
> > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <
> jdcryans@apache.org
> > >wrote:
> > >
> > >> This reminds me a lot of
> > https://issues.apache.org/jira/browse/HBASE-4792
> > >>
> > >> What I'd like to see is the log from the first time the master
> > >> receives the split message. I guess it says the region doesn't exist
> > >> anymore because the split was processed already in the master but
> > >> there's a failure mode similar to 4792.
> > >>
> > >> I saw this on another cluster last week but the logs were rolling
> > >> based on size so the original split message was lost.
> > >>
> > >> J-D
> > >>
> > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <mc...@hotpads.com>
> > wrote:
> > >> > I've had a few successful splits today without any errors.  I'll
> turn
> > up
> > >> > the importer speed tomorrow to start stressing the cluster more.
> > >> >
> > >> >
> > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <yu...@gmail.com> wrote:
> > >> >
> > >> >> Matt:
> > >> >> Was there any region server crash before you noticed the repeated
> log
> > >> for
> > >> >> bc62a8a72124a4ba3f6b9f302587903c ?
> > >> >>
> > >> >> I wonder if the following JIRA might be related:
> > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign missing
> daughter
> > >> >>
> > >> >> Thanks
> > >> >>
> > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <mc...@hotpads.com>
> > >> wrote:
> > >> >>
> > >> >> > Strangely, I don't see any record of that region in the master
> > before
> > >> >> what
> > >> >> > I already pasted even though I have logs back to 10/30.  Next
> time
> > it
> > >> >> > happens I'll gather a full log record and try to debug while it's
> > >> >> > occurring.
> > >> >> >
> > >> >> >
> > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
> > >> >> > chrajeshbabu32@gmail.com> wrote:
> > >> >> >
> > >> >> > > Hi Matt,
> > >> >> > > can you paste some more master logs of region
> > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
> > >> >> > > I think Its not problem with splitting.
> > >> >> > > We are getting
> > >> >> > >       LOG.warn("Region " + encodedName + " not found on server
> "
> > +
> > >> >> > > serverName +
> > >> >> > >         "; failed processing");
> > >> >> > > this log means no entry in servers map(not fully assigned).
> > >> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
> > >> >> > >     HRegionInfo foundHri = null;
> > >> >> > >     for (HRegionInfo hri: hris) {
> > >> >> > >       if (hri.getEncodedName().equals(encodedName)) {
> > >> >> > >         foundHri = hri;
> > >> >> > >         break;
> > >> >> > >       }
> > >> >> > >     }
> > >> >> > >     return foundHri;
> > >> >> > >
> > >> >> > >
> > >> >> > >
> > >> >> > >
> > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <
> > lhofhansl@yahoo.com>
> > >> >> > wrote:
> > >> >> > >
> > >> >> > > > CC'ing dev list...
> > >> >> > > >
> > >> >> > > > Is anybody aware of any changes that went in recently that
> > could
> > >> >> cause
> > >> >> > > > this?
> > >> >> > > > I looked around a bit, but could not find anything obvious.
> > >> >> > > >
> > >> >> > > > -- Lars
> > >> >> > > >
> > >> >> > > >
> > >> >> > > >
> > >> >> > > > ________________________________
> > >> >> > > >  From: Matt Corgan <mc...@hotpads.com>
> > >> >> > > > To: user <us...@hbase.apache.org>
> > >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
> > >> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> > >> >> > > >
> > >> >> > > > I think the cluster is ok without running hbck, as restarting
> > the
> > >> >> > > > regionserver process stops the warnings and everything looks
> ok
> > >> >> > > otherwise.
> > >> >> > > >
> > >> >> > > > here's the regionserver right after the split happens:
> > >> >> > > > ------------------------
> > >> >> > > > 2012-11-01 22:45:28,726 DEBUG
> > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition
> > node
> > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> > >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> transitioned
> > >> node
> > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still
> > >> waiting
> > >> >> on
> > >> >> > > the
> > >> >> > > > master to process the split for
> > bc62a8a72124a4ba3f6b9f302587903c
> > >> >> > > > 2012-11-01 22:45:28,832 DEBUG
> > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition
> > node
> > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> > >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
> > >> >> > > > 2012-11-01 22:45:28,837 DEBUG
> > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> transitioned
> > >> node
> > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
> > >> >> > > > -----------------------------
> > >> >> > > >
> > >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to
> > >> RS_ZK_REGION_SPLIT"
> > >> >> > > > continues for 15 or so hours and finally settles without
> manual
> > >> >> > > > intervention with these regionserver log messages:
> > >> >> > > > -----------------------
> > >> >> > > > 2012-11-02 13:55:00,906 DEBUG
> > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition
> > >> node *
> > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> > >> >> > > > RS_ZK_REGION_SPLIT
> > >> >> > > >
> > >> >> > > > 2012-11-02 13:55:00,916 DEBUG
> > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> transitioned
> > >> node *
> > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> > >> >> > > > RS_ZK_REGION_SPLIT
> > >> >> > > >
> > >> >> > > > 2012-11-02 13:55:00,916 INFO
> > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region
> > split,
> > >> META
> > >> >> > > > updated, and report to master.
> > >> >> > > >
> > >> >> > > >
> > >> >> > >
> > >> >> >
> > >> >>
> > >>
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> > >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
> > >> >> > > >
> > >> >> > > >
> > >> >> > >
> > >> >> >
> > >> >>
> > >>
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> > >> >> > > >
> > >> >> > > >
> > >> >> > >
> > >> >> >
> > >> >>
> > >>
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> > >> >> > > > Split took 15hrs, 9mins, 33sec
> > >> >> > > >
> > >> >> > > > 2012-11-02 13:55:00,945 INFO
> > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region
> > split,
> > >> META
> > >> >> > > > updated, and report to master.
> > >> >> > > >
> > >> >> > > >
> > >> >> > >
> > >> >> >
> > >> >>
> > >>
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> > >> >> > > > new regions:
> > >> >> > > >
> > >> >> > > >
> > >> >> > >
> > >> >> >
> > >> >>
> > >>
> >
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> > >> >> > > >
> > >> >> >
> > >>
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins,
> > 2sec
> > >> >> > > > ----------------------
> > >> >> > > >
> > >> >> > > > And the master finally logs this:
> > >> >> > > > -----------------------
> > >> >> > > > 2012-11-02 13:55:00,783 WARN
> > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on server
> > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279; failed
> > processing
> > >> >> > > >
> > >> >> > > > 2012-11-02 13:55:00,783 WARN
> > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Received
> > SPLIT
> > >> for
> > >> >> > > region
> > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't
> > exist
> > >> >> > > anymore,
> > >> >> > > > probably already processed its split
> > >> >> > > > ------------------------
> > >> >> > > >
> > >> >> > > > I can't find any evidence that the region left the node
> during
> > >> that
> > >> >> > time.
> > >> >> > > > I'll have to catch it in action next time and see what the
> > region
> > >> is
> > >> >> up
> > >> >> > > to
> > >> >> > > > during the problem period.
> > >> >> > > >
> > >> >> > > > What does it mean that it successfully transitioned from
> SPLIT
> > to
> > >> >> > SPLIT?
> > >> >> > > > Is that a valid transition?
> > >> >> > > >
> > >> >> > > > Matt
> > >> >> > > >
> > >> >> > > >
> > >> >> > > >
> > >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <yu...@gmail.com>
> > >> wrote:
> > >> >> > > >
> > >> >> > > > > Matt:
> > >> >> > > > > From the following we can see that region
> > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c
> > >> >> > > > > cannot be found:
> > >> >> > > > >
> > >> >> > > > > 2012-11-02 00:00:02,909 DEBUG
> > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling
> > >> >> > > > > transition=RS_ZK_REGION_SPLIT,
> > >> >> > > > > server=HadoopNode162.hotpads.srv,60020,1351788248279,
> > >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
> > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on server
> > >> >> > > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> > >> >> > > > > processing
> > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Received
> > SPLIT
> > >> >> for
> > >> >> > > > region
> > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it
> doesn't
> > >> exist
> > >> >> > > > anymore,
> > >> >> > > > > probably already processed its split
> > >> >> > > > >
> > >> >> > > > > Have you run hbck to repair your cluster ?
> > >> >> > > > >
> > >> >> > > > > Thanks
> > >> >> > > > >
> > >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
> > >> mcorgan@hotpads.com>
> > >> >> > > wrote:
> > >> >> > > > >
> > >> >> > > > > > Here's a sample of the master's logs from yesterday.
>  It's
> > not
> > >> >> > > > correlated
> > >> >> > > > > > exactly with the other pastebin log, but there's 3GB of
> > this
> > >> from
> > >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
> > >> >> > > > > >
> > >> >> > > > > > I'm am pushing the cluster a bit with importing data so
> > >> testing
> > >> >> the
> > >> >> > > > split
> > >> >> > > > > > code harder than normal.  The regions are 500-1GB
> gzipped.
> >  I
> > >> can
> > >> >> > > look
> > >> >> > > > > into
> > >> >> > > > > > it more but trying to figure out what to look for.
> > >> >> > > > > >
> > >> >> > > > > > Thanks Ted,
> > >> >> > > > > > Matt
> > >> >> > > > > >
> > >> >> > > > > >
> > >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <
> > yuzhihong@gmail.com>
> > >> >> > wrote:
> > >> >> > > > > >
> > >> >> > > > > > > Matt:
> > >> >> > > > > > > This is the method which made the logging:
> > >> >> > > > > > >   private static int tickleNodeSplit(ZooKeeperWatcher
> > zkw,
> > >> >> > > > > > >       HRegionInfo parent, HRegionInfo a, HRegionInfo b,
> > >> >> > ServerName
> > >> >> > > > > > > serverName,
> > >> >> > > > > > >       final int znodeVersion)
> > >> >> > > > > > >   throws KeeperException, IOException {
> > >> >> > > > > > >     byte [] payload = Writables.getBytes(a, b);
> > >> >> > > > > > >     return ZKAssign.transitionNode(zkw, parent,
> > serverName,
> > >> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
> > >> >> EventType.RS_ZK_REGION_SPLIT,
> > >> >> > > > > > >       znodeVersion, payload);
> > >> >> > > > > > >   }
> > >> >> > > > > > >
> > >> >> > > > > > > transitionZKNode() calls tickleNodeSplit() when waiting
> > for
> > >> >> > master
> > >> >> > > to
> > >> >> > > > > > split
> > >> >> > > > > > > the region. Obviously something caused the master not
> > able
> > >> to
> > >> >> > > split.
> > >> >> > > > > > >
> > >> >> > > > > > > How large is the region ?
> > >> >> > > > > > >
> > >> >> > > > > > > Can you pastebin master log for that period of time ?
> > >> >> > > > > > >
> > >> >> > > > > > > Thanks
> > >> >> > > > > > >
> > >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <
> > >> >> mcorgan@hotpads.com
> > >> >> > >
> > >> >> > > > > wrote:
> > >> >> > > > > > >
> > >> >> > > > > > > > We upgraded from .94.0 to .94.2 last week and have
> > >> started to
> > >> >> > > > > encounter
> > >> >> > > > > > > > infinite loops of region-transition on splits.  I'm
> not
> > >> sure
> > >> >> > yet
> > >> >> > > if
> > >> >> > > > > > it's
> > >> >> > > > > > > > all splits nor if it's related to load.  Solution so
> > far
> > >> has
> > >> >> > been
> > >> >> > > > to
> > >> >> > > > > > > > restart the regionserver process.
> > >> >> > > > > > > >
> > >> >> > > > > > > > log snippet:
> > >> >> > > > > > > > http://pastebin.com/LpienZ7B
> > >> >> > > > > > > >
> > >> >> > > > > > > > It's repeating these two lines:
> > >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
> > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Attempting to
> > >> transition
> > >> >> > > node
> > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> > RS_ZK_REGION_SPLIT
> > >> to
> > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> > >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
> > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Successfully
> > >> >> transitioned
> > >> >> > > node
> > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> > RS_ZK_REGION_SPLIT
> > >> to
> > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> > >> >> > > > > > > >
> > >> >> > > > > > > > with the occasional:
> > >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
> > >> >> > > > > > > >
> org.apache.hadoop.hbase.regionserver.SplitTransaction:
> > >> Still
> > >> >> > > > waiting
> > >> >> > > > > on
> > >> >> > > > > > > the
> > >> >> > > > > > > > master to process the split for
> > >> >> > cf3e9bc069e1888983c06dc8e053ffcf
> > >> >> > > > > > > >
> > >> >> > > > > > > > Should the region transition from RS_ZK_REGION_SPLIT
> to
> > >> >> itself?
> > >> >> > >  It
> > >> >> > > > > > looks
> > >> >> > > > > > > > wrong, but I'm not familiar with the code at all.
> > >> >> > > > > > > >
> > >> >> > > > > > > > Thanks,
> > >> >> > > > > > > > Matt
> > >> >> > > > > > > >
> > >> >> > > > > > >
> > >> >> > > > > >
> > >> >> > > > >
> > >> >> > > >
> > >> >> > >
> > >> >> >
> > >> >>
> > >>
> >
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by Matt Corgan <mc...@hotpads.com>.

Beats me - i can't find any record of it before that.

I'm importing data into another table now.  I disabled/enabled the table
first to make sure we have the original 4 region locations logged
everywhere.  Will report back...


On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Wow, and can you figure how this happened?
>
> 2012-11-05 00:24:54,538 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Instantiated
>
> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.
>
> Split? Else how did the master told that RS to open a region that doesn't
> exist?
>
> On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <mc...@hotpads.com> wrote:
> > Here's one from last night with the master logs at the bottom.
> > http://pastebin.com/cSdMbA2a
> >
> > I don't see that region in the master logs for 4 days back.  I'm going to
> > import some new data from scratch soon and will be sure to keep all the
> > master logs.
> >
> >
> > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> This reminds me a lot of
> https://issues.apache.org/jira/browse/HBASE-4792
> >>
> >> What I'd like to see is the log from the first time the master
> >> receives the split message. I guess it says the region doesn't exist
> >> anymore because the split was processed already in the master but
> >> there's a failure mode similar to 4792.
> >>
> >> I saw this on another cluster last week but the logs were rolling
> >> based on size so the original split message was lost.
> >>
> >> J-D
> >>
> >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <mc...@hotpads.com>
> wrote:
> >> > I've had a few successful splits today without any errors.  I'll turn
> up
> >> > the importer speed tomorrow to start stressing the cluster more.
> >> >
> >> >
> >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <yu...@gmail.com> wrote:
> >> >
> >> >> Matt:
> >> >> Was there any region server crash before you noticed the repeated log
> >> for
> >> >> bc62a8a72124a4ba3f6b9f302587903c ?
> >> >>
> >> >> I wonder if the following JIRA might be related:
> >> >> HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter
> >> >>
> >> >> Thanks
> >> >>
> >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <mc...@hotpads.com>
> >> wrote:
> >> >>
> >> >> > Strangely, I don't see any record of that region in the master
> before
> >> >> what
> >> >> > I already pasted even though I have logs back to 10/30.  Next time
> it
> >> >> > happens I'll gather a full log record and try to debug while it's
> >> >> > occurring.
> >> >> >
> >> >> >
> >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
> >> >> > chrajeshbabu32@gmail.com> wrote:
> >> >> >
> >> >> > > Hi Matt,
> >> >> > > can you paste some more master logs of region
> >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
> >> >> > > I think Its not problem with splitting.
> >> >> > > We are getting
> >> >> > >       LOG.warn("Region " + encodedName + " not found on server "
> +
> >> >> > > serverName +
> >> >> > >         "; failed processing");
> >> >> > > this log means no entry in servers map(not fully assigned).
> >> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
> >> >> > >     HRegionInfo foundHri = null;
> >> >> > >     for (HRegionInfo hri: hris) {
> >> >> > >       if (hri.getEncodedName().equals(encodedName)) {
> >> >> > >         foundHri = hri;
> >> >> > >         break;
> >> >> > >       }
> >> >> > >     }
> >> >> > >     return foundHri;
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <
> lhofhansl@yahoo.com>
> >> >> > wrote:
> >> >> > >
> >> >> > > > CC'ing dev list...
> >> >> > > >
> >> >> > > > Is anybody aware of any changes that went in recently that
> could
> >> >> cause
> >> >> > > > this?
> >> >> > > > I looked around a bit, but could not find anything obvious.
> >> >> > > >
> >> >> > > > -- Lars
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > > ________________________________
> >> >> > > >  From: Matt Corgan <mc...@hotpads.com>
> >> >> > > > To: user <us...@hbase.apache.org>
> >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
> >> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> >> >> > > >
> >> >> > > > I think the cluster is ok without running hbck, as restarting
> the
> >> >> > > > regionserver process stops the warnings and everything looks ok
> >> >> > > otherwise.
> >> >> > > >
> >> >> > > > here's the regionserver right after the split happens:
> >> >> > > > ------------------------
> >> >> > > > 2012-11-01 22:45:28,726 DEBUG
> >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition
> node
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned
> >> node
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> >> >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still
> >> waiting
> >> >> on
> >> >> > > the
> >> >> > > > master to process the split for
> bc62a8a72124a4ba3f6b9f302587903c
> >> >> > > > 2012-11-01 22:45:28,832 DEBUG
> >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition
> node
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
> >> >> > > > 2012-11-01 22:45:28,837 DEBUG
> >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned
> >> node
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
> >> >> > > > -----------------------------
> >> >> > > >
> >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to
> >> RS_ZK_REGION_SPLIT"
> >> >> > > > continues for 15 or so hours and finally settles without manual
> >> >> > > > intervention with these regionserver log messages:
> >> >> > > > -----------------------
> >> >> > > > 2012-11-02 13:55:00,906 DEBUG
> >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition
> >> node *
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> >> >> > > > RS_ZK_REGION_SPLIT
> >> >> > > >
> >> >> > > > 2012-11-02 13:55:00,916 DEBUG
> >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned
> >> node *
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> >> >> > > > RS_ZK_REGION_SPLIT
> >> >> > > >
> >> >> > > > 2012-11-02 13:55:00,916 INFO
> >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region
> split,
> >> META
> >> >> > > > updated, and report to master.
> >> >> > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
> >> >> > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> >> >> > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> >> >> > > > Split took 15hrs, 9mins, 33sec
> >> >> > > >
> >> >> > > > 2012-11-02 13:55:00,945 INFO
> >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region
> split,
> >> META
> >> >> > > > updated, and report to master.
> >> >> > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> >> >> > > > new regions:
> >> >> > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> >> >> > > >
> >> >> >
> >> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins,
> 2sec
> >> >> > > > ----------------------
> >> >> > > >
> >> >> > > > And the master finally logs this:
> >> >> > > > -----------------------
> >> >> > > > 2012-11-02 13:55:00,783 WARN
> >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on server
> >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279; failed
> processing
> >> >> > > >
> >> >> > > > 2012-11-02 13:55:00,783 WARN
> >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Received
> SPLIT
> >> for
> >> >> > > region
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
> >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't
> exist
> >> >> > > anymore,
> >> >> > > > probably already processed its split
> >> >> > > > ------------------------
> >> >> > > >
> >> >> > > > I can't find any evidence that the region left the node during
> >> that
> >> >> > time.
> >> >> > > > I'll have to catch it in action next time and see what the
> region
> >> is
> >> >> up
> >> >> > > to
> >> >> > > > during the problem period.
> >> >> > > >
> >> >> > > > What does it mean that it successfully transitioned from SPLIT
> to
> >> >> > SPLIT?
> >> >> > > > Is that a valid transition?
> >> >> > > >
> >> >> > > > Matt
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <yu...@gmail.com>
> >> wrote:
> >> >> > > >
> >> >> > > > > Matt:
> >> >> > > > > From the following we can see that region
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c
> >> >> > > > > cannot be found:
> >> >> > > > >
> >> >> > > > > 2012-11-02 00:00:02,909 DEBUG
> >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling
> >> >> > > > > transition=RS_ZK_REGION_SPLIT,
> >> >> > > > > server=HadoopNode162.hotpads.srv,60020,1351788248279,
> >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
> >> >> > > > > 2012-11-02 00:00:02,909 WARN
> >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on server
> >> >> > > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> >> >> > > > > processing
> >> >> > > > > 2012-11-02 00:00:02,909 WARN
> >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Received
> SPLIT
> >> >> for
> >> >> > > > region
> >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
> >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't
> >> exist
> >> >> > > > anymore,
> >> >> > > > > probably already processed its split
> >> >> > > > >
> >> >> > > > > Have you run hbck to repair your cluster ?
> >> >> > > > >
> >> >> > > > > Thanks
> >> >> > > > >
> >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
> >> mcorgan@hotpads.com>
> >> >> > > wrote:
> >> >> > > > >
> >> >> > > > > > Here's a sample of the master's logs from yesterday.  It's
> not
> >> >> > > > correlated
> >> >> > > > > > exactly with the other pastebin log, but there's 3GB of
> this
> >> from
> >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
> >> >> > > > > >
> >> >> > > > > > I'm am pushing the cluster a bit with importing data so
> >> testing
> >> >> the
> >> >> > > > split
> >> >> > > > > > code harder than normal.  The regions are 500-1GB gzipped.
>  I
> >> can
> >> >> > > look
> >> >> > > > > into
> >> >> > > > > > it more but trying to figure out what to look for.
> >> >> > > > > >
> >> >> > > > > > Thanks Ted,
> >> >> > > > > > Matt
> >> >> > > > > >
> >> >> > > > > >
> >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <
> yuzhihong@gmail.com>
> >> >> > wrote:
> >> >> > > > > >
> >> >> > > > > > > Matt:
> >> >> > > > > > > This is the method which made the logging:
> >> >> > > > > > >   private static int tickleNodeSplit(ZooKeeperWatcher
> zkw,
> >> >> > > > > > >       HRegionInfo parent, HRegionInfo a, HRegionInfo b,
> >> >> > ServerName
> >> >> > > > > > > serverName,
> >> >> > > > > > >       final int znodeVersion)
> >> >> > > > > > >   throws KeeperException, IOException {
> >> >> > > > > > >     byte [] payload = Writables.getBytes(a, b);
> >> >> > > > > > >     return ZKAssign.transitionNode(zkw, parent,
> serverName,
> >> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
> >> >> EventType.RS_ZK_REGION_SPLIT,
> >> >> > > > > > >       znodeVersion, payload);
> >> >> > > > > > >   }
> >> >> > > > > > >
> >> >> > > > > > > transitionZKNode() calls tickleNodeSplit() when waiting
> for
> >> >> > master
> >> >> > > to
> >> >> > > > > > split
> >> >> > > > > > > the region. Obviously something caused the master not
> able
> >> to
> >> >> > > split.
> >> >> > > > > > >
> >> >> > > > > > > How large is the region ?
> >> >> > > > > > >
> >> >> > > > > > > Can you pastebin master log for that period of time ?
> >> >> > > > > > >
> >> >> > > > > > > Thanks
> >> >> > > > > > >
> >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <
> >> >> mcorgan@hotpads.com
> >> >> > >
> >> >> > > > > wrote:
> >> >> > > > > > >
> >> >> > > > > > > > We upgraded from .94.0 to .94.2 last week and have
> >> started to
> >> >> > > > > encounter
> >> >> > > > > > > > infinite loops of region-transition on splits.  I'm not
> >> sure
> >> >> > yet
> >> >> > > if
> >> >> > > > > > it's
> >> >> > > > > > > > all splits nor if it's related to load.  Solution so
> far
> >> has
> >> >> > been
> >> >> > > > to
> >> >> > > > > > > > restart the regionserver process.
> >> >> > > > > > > >
> >> >> > > > > > > > log snippet:
> >> >> > > > > > > > http://pastebin.com/LpienZ7B
> >> >> > > > > > > >
> >> >> > > > > > > > It's repeating these two lines:
> >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
> >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Attempting to
> >> transition
> >> >> > > node
> >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> RS_ZK_REGION_SPLIT
> >> to
> >> >> > > > > > > > RS_ZK_REGION_SPLIT
> >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
> >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Successfully
> >> >> transitioned
> >> >> > > node
> >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> RS_ZK_REGION_SPLIT
> >> to
> >> >> > > > > > > > RS_ZK_REGION_SPLIT
> >> >> > > > > > > >
> >> >> > > > > > > > with the occasional:
> >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
> >> >> > > > > > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> >> Still
> >> >> > > > waiting
> >> >> > > > > on
> >> >> > > > > > > the
> >> >> > > > > > > > master to process the split for
> >> >> > cf3e9bc069e1888983c06dc8e053ffcf
> >> >> > > > > > > >
> >> >> > > > > > > > Should the region transition from RS_ZK_REGION_SPLIT to
> >> >> itself?
> >> >> > >  It
> >> >> > > > > > looks
> >> >> > > > > > > > wrong, but I'm not familiar with the code at all.
> >> >> > > > > > > >
> >> >> > > > > > > > Thanks,
> >> >> > > > > > > > Matt
> >> >> > > > > > > >
> >> >> > > > > > >
> >> >> > > > > >
> >> >> > > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Wow, and can you figure how this happened?

2012-11-05 00:24:54,538 DEBUG
org.apache.hadoop.hbase.regionserver.HRegion: Instantiated
ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.

Split? Else how did the master told that RS to open a region that doesn't exist?

On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <mc...@hotpads.com> wrote:
> Here's one from last night with the master logs at the bottom.
> http://pastebin.com/cSdMbA2a
>
> I don't see that region in the master logs for 4 days back.  I'm going to
> import some new data from scratch soon and will be sure to keep all the
> master logs.
>
>
> On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> This reminds me a lot of https://issues.apache.org/jira/browse/HBASE-4792
>>
>> What I'd like to see is the log from the first time the master
>> receives the split message. I guess it says the region doesn't exist
>> anymore because the split was processed already in the master but
>> there's a failure mode similar to 4792.
>>
>> I saw this on another cluster last week but the logs were rolling
>> based on size so the original split message was lost.
>>
>> J-D
>>
>> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <mc...@hotpads.com> wrote:
>> > I've had a few successful splits today without any errors.  I'll turn up
>> > the importer speed tomorrow to start stressing the cluster more.
>> >
>> >
>> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <yu...@gmail.com> wrote:
>> >
>> >> Matt:
>> >> Was there any region server crash before you noticed the repeated log
>> for
>> >> bc62a8a72124a4ba3f6b9f302587903c ?
>> >>
>> >> I wonder if the following JIRA might be related:
>> >> HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter
>> >>
>> >> Thanks
>> >>
>> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <mc...@hotpads.com>
>> wrote:
>> >>
>> >> > Strangely, I don't see any record of that region in the master before
>> >> what
>> >> > I already pasted even though I have logs back to 10/30.  Next time it
>> >> > happens I'll gather a full log record and try to debug while it's
>> >> > occurring.
>> >> >
>> >> >
>> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
>> >> > chrajeshbabu32@gmail.com> wrote:
>> >> >
>> >> > > Hi Matt,
>> >> > > can you paste some more master logs of region
>> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
>> >> > > I think Its not problem with splitting.
>> >> > > We are getting
>> >> > >       LOG.warn("Region " + encodedName + " not found on server " +
>> >> > > serverName +
>> >> > >         "; failed processing");
>> >> > > this log means no entry in servers map(not fully assigned).
>> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
>> >> > >     HRegionInfo foundHri = null;
>> >> > >     for (HRegionInfo hri: hris) {
>> >> > >       if (hri.getEncodedName().equals(encodedName)) {
>> >> > >         foundHri = hri;
>> >> > >         break;
>> >> > >       }
>> >> > >     }
>> >> > >     return foundHri;
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <lh...@yahoo.com>
>> >> > wrote:
>> >> > >
>> >> > > > CC'ing dev list...
>> >> > > >
>> >> > > > Is anybody aware of any changes that went in recently that could
>> >> cause
>> >> > > > this?
>> >> > > > I looked around a bit, but could not find anything obvious.
>> >> > > >
>> >> > > > -- Lars
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > ________________________________
>> >> > > >  From: Matt Corgan <mc...@hotpads.com>
>> >> > > > To: user <us...@hbase.apache.org>
>> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
>> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>> >> > > >
>> >> > > > I think the cluster is ok without running hbck, as restarting the
>> >> > > > regionserver process stops the warnings and everything looks ok
>> >> > > otherwise.
>> >> > > >
>> >> > > > here's the regionserver right after the split happens:
>> >> > > > ------------------------
>> >> > > > 2012-11-01 22:45:28,726 DEBUG
>> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node
>> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
>> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
>> >> > > > 2012-11-01 22:45:28,730 DEBUG
>> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned
>> node
>> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
>> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
>> >> > > > 2012-11-01 22:45:28,730 DEBUG
>> >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still
>> waiting
>> >> on
>> >> > > the
>> >> > > > master to process the split for bc62a8a72124a4ba3f6b9f302587903c
>> >> > > > 2012-11-01 22:45:28,832 DEBUG
>> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node
>> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
>> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
>> >> > > > 2012-11-01 22:45:28,837 DEBUG
>> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned
>> node
>> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
>> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
>> >> > > > -----------------------------
>> >> > > >
>> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to
>> RS_ZK_REGION_SPLIT"
>> >> > > > continues for 15 or so hours and finally settles without manual
>> >> > > > intervention with these regionserver log messages:
>> >> > > > -----------------------
>> >> > > > 2012-11-02 13:55:00,906 DEBUG
>> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition
>> node *
>> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
>> >> > > > RS_ZK_REGION_SPLIT
>> >> > > >
>> >> > > > 2012-11-02 13:55:00,916 DEBUG
>> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned
>> node *
>> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
>> >> > > > RS_ZK_REGION_SPLIT
>> >> > > >
>> >> > > > 2012-11-02 13:55:00,916 INFO
>> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split,
>> META
>> >> > > > updated, and report to master.
>> >> > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
>> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
>> >> > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
>> >> > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
>> >> > > > Split took 15hrs, 9mins, 33sec
>> >> > > >
>> >> > > > 2012-11-02 13:55:00,945 INFO
>> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split,
>> META
>> >> > > > updated, and report to master.
>> >> > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
>> >> > > > new regions:
>> >> > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
>> >> > > >
>> >> >
>> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
>> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins, 2sec
>> >> > > > ----------------------
>> >> > > >
>> >> > > > And the master finally logs this:
>> >> > > > -----------------------
>> >> > > > 2012-11-02 13:55:00,783 WARN
>> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
>> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on server
>> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279; failed processing
>> >> > > >
>> >> > > > 2012-11-02 13:55:00,783 WARN
>> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT
>> for
>> >> > > region
>> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
>> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist
>> >> > > anymore,
>> >> > > > probably already processed its split
>> >> > > > ------------------------
>> >> > > >
>> >> > > > I can't find any evidence that the region left the node during
>> that
>> >> > time.
>> >> > > > I'll have to catch it in action next time and see what the region
>> is
>> >> up
>> >> > > to
>> >> > > > during the problem period.
>> >> > > >
>> >> > > > What does it mean that it successfully transitioned from SPLIT to
>> >> > SPLIT?
>> >> > > > Is that a valid transition?
>> >> > > >
>> >> > > > Matt
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <yu...@gmail.com>
>> wrote:
>> >> > > >
>> >> > > > > Matt:
>> >> > > > > From the following we can see that region
>> >> > > > bc62a8a72124a4ba3f6b9f302587903c
>> >> > > > > cannot be found:
>> >> > > > >
>> >> > > > > 2012-11-02 00:00:02,909 DEBUG
>> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling
>> >> > > > > transition=RS_ZK_REGION_SPLIT,
>> >> > > > > server=HadoopNode162.hotpads.srv,60020,1351788248279,
>> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
>> >> > > > > 2012-11-02 00:00:02,909 WARN
>> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
>> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on server
>> >> > > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
>> >> > > > > processing
>> >> > > > > 2012-11-02 00:00:02,909 WARN
>> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT
>> >> for
>> >> > > > region
>> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
>> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't
>> exist
>> >> > > > anymore,
>> >> > > > > probably already processed its split
>> >> > > > >
>> >> > > > > Have you run hbck to repair your cluster ?
>> >> > > > >
>> >> > > > > Thanks
>> >> > > > >
>> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
>> mcorgan@hotpads.com>
>> >> > > wrote:
>> >> > > > >
>> >> > > > > > Here's a sample of the master's logs from yesterday.  It's not
>> >> > > > correlated
>> >> > > > > > exactly with the other pastebin log, but there's 3GB of this
>> from
>> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
>> >> > > > > >
>> >> > > > > > I'm am pushing the cluster a bit with importing data so
>> testing
>> >> the
>> >> > > > split
>> >> > > > > > code harder than normal.  The regions are 500-1GB gzipped.  I
>> can
>> >> > > look
>> >> > > > > into
>> >> > > > > > it more but trying to figure out what to look for.
>> >> > > > > >
>> >> > > > > > Thanks Ted,
>> >> > > > > > Matt
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <yu...@gmail.com>
>> >> > wrote:
>> >> > > > > >
>> >> > > > > > > Matt:
>> >> > > > > > > This is the method which made the logging:
>> >> > > > > > >   private static int tickleNodeSplit(ZooKeeperWatcher zkw,
>> >> > > > > > >       HRegionInfo parent, HRegionInfo a, HRegionInfo b,
>> >> > ServerName
>> >> > > > > > > serverName,
>> >> > > > > > >       final int znodeVersion)
>> >> > > > > > >   throws KeeperException, IOException {
>> >> > > > > > >     byte [] payload = Writables.getBytes(a, b);
>> >> > > > > > >     return ZKAssign.transitionNode(zkw, parent, serverName,
>> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
>> >> EventType.RS_ZK_REGION_SPLIT,
>> >> > > > > > >       znodeVersion, payload);
>> >> > > > > > >   }
>> >> > > > > > >
>> >> > > > > > > transitionZKNode() calls tickleNodeSplit() when waiting for
>> >> > master
>> >> > > to
>> >> > > > > > split
>> >> > > > > > > the region. Obviously something caused the master not able
>> to
>> >> > > split.
>> >> > > > > > >
>> >> > > > > > > How large is the region ?
>> >> > > > > > >
>> >> > > > > > > Can you pastebin master log for that period of time ?
>> >> > > > > > >
>> >> > > > > > > Thanks
>> >> > > > > > >
>> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <
>> >> mcorgan@hotpads.com
>> >> > >
>> >> > > > > wrote:
>> >> > > > > > >
>> >> > > > > > > > We upgraded from .94.0 to .94.2 last week and have
>> started to
>> >> > > > > encounter
>> >> > > > > > > > infinite loops of region-transition on splits.  I'm not
>> sure
>> >> > yet
>> >> > > if
>> >> > > > > > it's
>> >> > > > > > > > all splits nor if it's related to load.  Solution so far
>> has
>> >> > been
>> >> > > > to
>> >> > > > > > > > restart the regionserver process.
>> >> > > > > > > >
>> >> > > > > > > > log snippet:
>> >> > > > > > > > http://pastebin.com/LpienZ7B
>> >> > > > > > > >
>> >> > > > > > > > It's repeating these two lines:
>> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
>> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Attempting to
>> transition
>> >> > > node
>> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT
>> to
>> >> > > > > > > > RS_ZK_REGION_SPLIT
>> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
>> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Successfully
>> >> transitioned
>> >> > > node
>> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT
>> to
>> >> > > > > > > > RS_ZK_REGION_SPLIT
>> >> > > > > > > >
>> >> > > > > > > > with the occasional:
>> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
>> >> > > > > > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
>> Still
>> >> > > > waiting
>> >> > > > > on
>> >> > > > > > > the
>> >> > > > > > > > master to process the split for
>> >> > cf3e9bc069e1888983c06dc8e053ffcf
>> >> > > > > > > >
>> >> > > > > > > > Should the region transition from RS_ZK_REGION_SPLIT to
>> >> itself?
>> >> > >  It
>> >> > > > > > looks
>> >> > > > > > > > wrong, but I'm not familiar with the code at all.
>> >> > > > > > > >
>> >> > > > > > > > Thanks,
>> >> > > > > > > > Matt
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by Matt Corgan <mc...@hotpads.com>.

Here's one from last night with the master logs at the bottom.
http://pastebin.com/cSdMbA2a

I don't see that region in the master logs for 4 days back.  I'm going to
import some new data from scratch soon and will be sure to keep all the
master logs.


On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> This reminds me a lot of https://issues.apache.org/jira/browse/HBASE-4792
>
> What I'd like to see is the log from the first time the master
> receives the split message. I guess it says the region doesn't exist
> anymore because the split was processed already in the master but
> there's a failure mode similar to 4792.
>
> I saw this on another cluster last week but the logs were rolling
> based on size so the original split message was lost.
>
> J-D
>
> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <mc...@hotpads.com> wrote:
> > I've had a few successful splits today without any errors.  I'll turn up
> > the importer speed tomorrow to start stressing the cluster more.
> >
> >
> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> Matt:
> >> Was there any region server crash before you noticed the repeated log
> for
> >> bc62a8a72124a4ba3f6b9f302587903c ?
> >>
> >> I wonder if the following JIRA might be related:
> >> HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter
> >>
> >> Thanks
> >>
> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <mc...@hotpads.com>
> wrote:
> >>
> >> > Strangely, I don't see any record of that region in the master before
> >> what
> >> > I already pasted even though I have logs back to 10/30.  Next time it
> >> > happens I'll gather a full log record and try to debug while it's
> >> > occurring.
> >> >
> >> >
> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
> >> > chrajeshbabu32@gmail.com> wrote:
> >> >
> >> > > Hi Matt,
> >> > > can you paste some more master logs of region
> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
> >> > > I think Its not problem with splitting.
> >> > > We are getting
> >> > >       LOG.warn("Region " + encodedName + " not found on server " +
> >> > > serverName +
> >> > >         "; failed processing");
> >> > > this log means no entry in servers map(not fully assigned).
> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
> >> > >     HRegionInfo foundHri = null;
> >> > >     for (HRegionInfo hri: hris) {
> >> > >       if (hri.getEncodedName().equals(encodedName)) {
> >> > >         foundHri = hri;
> >> > >         break;
> >> > >       }
> >> > >     }
> >> > >     return foundHri;
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <lh...@yahoo.com>
> >> > wrote:
> >> > >
> >> > > > CC'ing dev list...
> >> > > >
> >> > > > Is anybody aware of any changes that went in recently that could
> >> cause
> >> > > > this?
> >> > > > I looked around a bit, but could not find anything obvious.
> >> > > >
> >> > > > -- Lars
> >> > > >
> >> > > >
> >> > > >
> >> > > > ________________________________
> >> > > >  From: Matt Corgan <mc...@hotpads.com>
> >> > > > To: user <us...@hbase.apache.org>
> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> >> > > >
> >> > > > I think the cluster is ok without running hbck, as restarting the
> >> > > > regionserver process stops the warnings and everything looks ok
> >> > > otherwise.
> >> > > >
> >> > > > here's the regionserver right after the split happens:
> >> > > > ------------------------
> >> > > > 2012-11-01 22:45:28,726 DEBUG
> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node
> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> >> > > > 2012-11-01 22:45:28,730 DEBUG
> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned
> node
> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> >> > > > 2012-11-01 22:45:28,730 DEBUG
> >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still
> waiting
> >> on
> >> > > the
> >> > > > master to process the split for bc62a8a72124a4ba3f6b9f302587903c
> >> > > > 2012-11-01 22:45:28,832 DEBUG
> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node
> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
> >> > > > 2012-11-01 22:45:28,837 DEBUG
> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned
> node
> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
> >> > > > -----------------------------
> >> > > >
> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to
> RS_ZK_REGION_SPLIT"
> >> > > > continues for 15 or so hours and finally settles without manual
> >> > > > intervention with these regionserver log messages:
> >> > > > -----------------------
> >> > > > 2012-11-02 13:55:00,906 DEBUG
> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition
> node *
> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> >> > > > RS_ZK_REGION_SPLIT
> >> > > >
> >> > > > 2012-11-02 13:55:00,916 DEBUG
> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned
> node *
> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> >> > > > RS_ZK_REGION_SPLIT
> >> > > >
> >> > > > 2012-11-02 13:55:00,916 INFO
> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split,
> META
> >> > > > updated, and report to master.
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> >> > > > Split took 15hrs, 9mins, 33sec
> >> > > >
> >> > > > 2012-11-02 13:55:00,945 INFO
> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split,
> META
> >> > > > updated, and report to master.
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> >> > > > new regions:
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> >> > > >
> >> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins, 2sec
> >> > > > ----------------------
> >> > > >
> >> > > > And the master finally logs this:
> >> > > > -----------------------
> >> > > > 2012-11-02 13:55:00,783 WARN
> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on server
> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279; failed processing
> >> > > >
> >> > > > 2012-11-02 13:55:00,783 WARN
> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT
> for
> >> > > region
> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist
> >> > > anymore,
> >> > > > probably already processed its split
> >> > > > ------------------------
> >> > > >
> >> > > > I can't find any evidence that the region left the node during
> that
> >> > time.
> >> > > > I'll have to catch it in action next time and see what the region
> is
> >> up
> >> > > to
> >> > > > during the problem period.
> >> > > >
> >> > > > What does it mean that it successfully transitioned from SPLIT to
> >> > SPLIT?
> >> > > > Is that a valid transition?
> >> > > >
> >> > > > Matt
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <yu...@gmail.com>
> wrote:
> >> > > >
> >> > > > > Matt:
> >> > > > > From the following we can see that region
> >> > > > bc62a8a72124a4ba3f6b9f302587903c
> >> > > > > cannot be found:
> >> > > > >
> >> > > > > 2012-11-02 00:00:02,909 DEBUG
> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling
> >> > > > > transition=RS_ZK_REGION_SPLIT,
> >> > > > > server=HadoopNode162.hotpads.srv,60020,1351788248279,
> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
> >> > > > > 2012-11-02 00:00:02,909 WARN
> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on server
> >> > > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> >> > > > > processing
> >> > > > > 2012-11-02 00:00:02,909 WARN
> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT
> >> for
> >> > > > region
> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't
> exist
> >> > > > anymore,
> >> > > > > probably already processed its split
> >> > > > >
> >> > > > > Have you run hbck to repair your cluster ?
> >> > > > >
> >> > > > > Thanks
> >> > > > >
> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
> mcorgan@hotpads.com>
> >> > > wrote:
> >> > > > >
> >> > > > > > Here's a sample of the master's logs from yesterday.  It's not
> >> > > > correlated
> >> > > > > > exactly with the other pastebin log, but there's 3GB of this
> from
> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
> >> > > > > >
> >> > > > > > I'm am pushing the cluster a bit with importing data so
> testing
> >> the
> >> > > > split
> >> > > > > > code harder than normal.  The regions are 500-1GB gzipped.  I
> can
> >> > > look
> >> > > > > into
> >> > > > > > it more but trying to figure out what to look for.
> >> > > > > >
> >> > > > > > Thanks Ted,
> >> > > > > > Matt
> >> > > > > >
> >> > > > > >
> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <yu...@gmail.com>
> >> > wrote:
> >> > > > > >
> >> > > > > > > Matt:
> >> > > > > > > This is the method which made the logging:
> >> > > > > > >   private static int tickleNodeSplit(ZooKeeperWatcher zkw,
> >> > > > > > >       HRegionInfo parent, HRegionInfo a, HRegionInfo b,
> >> > ServerName
> >> > > > > > > serverName,
> >> > > > > > >       final int znodeVersion)
> >> > > > > > >   throws KeeperException, IOException {
> >> > > > > > >     byte [] payload = Writables.getBytes(a, b);
> >> > > > > > >     return ZKAssign.transitionNode(zkw, parent, serverName,
> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
> >> EventType.RS_ZK_REGION_SPLIT,
> >> > > > > > >       znodeVersion, payload);
> >> > > > > > >   }
> >> > > > > > >
> >> > > > > > > transitionZKNode() calls tickleNodeSplit() when waiting for
> >> > master
> >> > > to
> >> > > > > > split
> >> > > > > > > the region. Obviously something caused the master not able
> to
> >> > > split.
> >> > > > > > >
> >> > > > > > > How large is the region ?
> >> > > > > > >
> >> > > > > > > Can you pastebin master log for that period of time ?
> >> > > > > > >
> >> > > > > > > Thanks
> >> > > > > > >
> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <
> >> mcorgan@hotpads.com
> >> > >
> >> > > > > wrote:
> >> > > > > > >
> >> > > > > > > > We upgraded from .94.0 to .94.2 last week and have
> started to
> >> > > > > encounter
> >> > > > > > > > infinite loops of region-transition on splits.  I'm not
> sure
> >> > yet
> >> > > if
> >> > > > > > it's
> >> > > > > > > > all splits nor if it's related to load.  Solution so far
> has
> >> > been
> >> > > > to
> >> > > > > > > > restart the regionserver process.
> >> > > > > > > >
> >> > > > > > > > log snippet:
> >> > > > > > > > http://pastebin.com/LpienZ7B
> >> > > > > > > >
> >> > > > > > > > It's repeating these two lines:
> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Attempting to
> transition
> >> > > node
> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT
> to
> >> > > > > > > > RS_ZK_REGION_SPLIT
> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Successfully
> >> transitioned
> >> > > node
> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT
> to
> >> > > > > > > > RS_ZK_REGION_SPLIT
> >> > > > > > > >
> >> > > > > > > > with the occasional:
> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
> >> > > > > > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> Still
> >> > > > waiting
> >> > > > > on
> >> > > > > > > the
> >> > > > > > > > master to process the split for
> >> > cf3e9bc069e1888983c06dc8e053ffcf
> >> > > > > > > >
> >> > > > > > > > Should the region transition from RS_ZK_REGION_SPLIT to
> >> itself?
> >> > >  It
> >> > > > > > looks
> >> > > > > > > > wrong, but I'm not familiar with the code at all.
> >> > > > > > > >
> >> > > > > > > > Thanks,
> >> > > > > > > > Matt
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by Jean-Daniel Cryans <jd...@apache.org>.

This reminds me a lot of https://issues.apache.org/jira/browse/HBASE-4792

What I'd like to see is the log from the first time the master
receives the split message. I guess it says the region doesn't exist
anymore because the split was processed already in the master but
there's a failure mode similar to 4792.

I saw this on another cluster last week but the logs were rolling
based on size so the original split message was lost.

J-D

On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <mc...@hotpads.com> wrote:
> I've had a few successful splits today without any errors.  I'll turn up
> the importer speed tomorrow to start stressing the cluster more.
>
>
> On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> Matt:
>> Was there any region server crash before you noticed the repeated log for
>> bc62a8a72124a4ba3f6b9f302587903c ?
>>
>> I wonder if the following JIRA might be related:
>> HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter
>>
>> Thanks
>>
>> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <mc...@hotpads.com> wrote:
>>
>> > Strangely, I don't see any record of that region in the master before
>> what
>> > I already pasted even though I have logs back to 10/30.  Next time it
>> > happens I'll gather a full log record and try to debug while it's
>> > occurring.
>> >
>> >
>> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
>> > chrajeshbabu32@gmail.com> wrote:
>> >
>> > > Hi Matt,
>> > > can you paste some more master logs of region
>> > > bc62a8a72124a4ba3f6b9f30258790 before split.
>> > > I think Its not problem with splitting.
>> > > We are getting
>> > >       LOG.warn("Region " + encodedName + " not found on server " +
>> > > serverName +
>> > >         "; failed processing");
>> > > this log means no entry in servers map(not fully assigned).
>> > >     Set<HRegionInfo> hris = this.servers.get(sn);
>> > >     HRegionInfo foundHri = null;
>> > >     for (HRegionInfo hri: hris) {
>> > >       if (hri.getEncodedName().equals(encodedName)) {
>> > >         foundHri = hri;
>> > >         break;
>> > >       }
>> > >     }
>> > >     return foundHri;
>> > >
>> > >
>> > >
>> > >
>> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <lh...@yahoo.com>
>> > wrote:
>> > >
>> > > > CC'ing dev list...
>> > > >
>> > > > Is anybody aware of any changes that went in recently that could
>> cause
>> > > > this?
>> > > > I looked around a bit, but could not find anything obvious.
>> > > >
>> > > > -- Lars
>> > > >
>> > > >
>> > > >
>> > > > ________________________________
>> > > >  From: Matt Corgan <mc...@hotpads.com>
>> > > > To: user <us...@hbase.apache.org>
>> > > > Sent: Saturday, November 3, 2012 5:27 PM
>> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>> > > >
>> > > > I think the cluster is ok without running hbck, as restarting the
>> > > > regionserver process stops the warnings and everything looks ok
>> > > otherwise.
>> > > >
>> > > > here's the regionserver right after the split happens:
>> > > > ------------------------
>> > > > 2012-11-01 22:45:28,726 DEBUG
>> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node
>> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
>> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
>> > > > 2012-11-01 22:45:28,730 DEBUG
>> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node
>> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
>> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
>> > > > 2012-11-01 22:45:28,730 DEBUG
>> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting
>> on
>> > > the
>> > > > master to process the split for bc62a8a72124a4ba3f6b9f302587903c
>> > > > 2012-11-01 22:45:28,832 DEBUG
>> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node
>> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
>> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
>> > > > 2012-11-01 22:45:28,837 DEBUG
>> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node
>> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
>> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
>> > > > -----------------------------
>> > > >
>> > > > The "transitioned node from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT"
>> > > > continues for 15 or so hours and finally settles without manual
>> > > > intervention with these regionserver log messages:
>> > > > -----------------------
>> > > > 2012-11-02 13:55:00,906 DEBUG
>> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node *
>> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
>> > > > RS_ZK_REGION_SPLIT
>> > > >
>> > > > 2012-11-02 13:55:00,916 DEBUG
>> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node *
>> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
>> > > > RS_ZK_REGION_SPLIT
>> > > >
>> > > > 2012-11-02 13:55:00,916 INFO
>> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
>> > > > updated, and report to master.
>> > > >
>> > > >
>> > >
>> >
>> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
>> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
>> > > >
>> > > >
>> > >
>> >
>> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
>> > > >
>> > > >
>> > >
>> >
>> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
>> > > > Split took 15hrs, 9mins, 33sec
>> > > >
>> > > > 2012-11-02 13:55:00,945 INFO
>> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
>> > > > updated, and report to master.
>> > > >
>> > > >
>> > >
>> >
>> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
>> > > > new regions:
>> > > >
>> > > >
>> > >
>> >
>> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
>> > > >
>> > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
>> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins, 2sec
>> > > > ----------------------
>> > > >
>> > > > And the master finally logs this:
>> > > > -----------------------
>> > > > 2012-11-02 13:55:00,783 WARN
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
>> > > > bc62a8a72124a4ba3f6b9f302587903c not found on server
>> > > > HadoopNode162.hotpads.srv,60020,1351788248279; failed processing
>> > > >
>> > > > 2012-11-02 13:55:00,783 WARN
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for
>> > > region
>> > > > bc62a8a72124a4ba3f6b9f302587903c from server
>> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist
>> > > anymore,
>> > > > probably already processed its split
>> > > > ------------------------
>> > > >
>> > > > I can't find any evidence that the region left the node during that
>> > time.
>> > > > I'll have to catch it in action next time and see what the region is
>> up
>> > > to
>> > > > during the problem period.
>> > > >
>> > > > What does it mean that it successfully transitioned from SPLIT to
>> > SPLIT?
>> > > > Is that a valid transition?
>> > > >
>> > > > Matt
>> > > >
>> > > >
>> > > >
>> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <yu...@gmail.com> wrote:
>> > > >
>> > > > > Matt:
>> > > > > From the following we can see that region
>> > > > bc62a8a72124a4ba3f6b9f302587903c
>> > > > > cannot be found:
>> > > > >
>> > > > > 2012-11-02 00:00:02,909 DEBUG
>> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling
>> > > > > transition=RS_ZK_REGION_SPLIT,
>> > > > > server=HadoopNode162.hotpads.srv,60020,1351788248279,
>> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
>> > > > > 2012-11-02 00:00:02,909 WARN
>> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
>> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on server
>> > > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
>> > > > > processing
>> > > > > 2012-11-02 00:00:02,909 WARN
>> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT
>> for
>> > > > region
>> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
>> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist
>> > > > anymore,
>> > > > > probably already processed its split
>> > > > >
>> > > > > Have you run hbck to repair your cluster ?
>> > > > >
>> > > > > Thanks
>> > > > >
>> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <mc...@hotpads.com>
>> > > wrote:
>> > > > >
>> > > > > > Here's a sample of the master's logs from yesterday.  It's not
>> > > > correlated
>> > > > > > exactly with the other pastebin log, but there's 3GB of this from
>> > > > > > yesterday: http://pastebin.com/wP2rNN1t
>> > > > > >
>> > > > > > I'm am pushing the cluster a bit with importing data so testing
>> the
>> > > > split
>> > > > > > code harder than normal.  The regions are 500-1GB gzipped.  I can
>> > > look
>> > > > > into
>> > > > > > it more but trying to figure out what to look for.
>> > > > > >
>> > > > > > Thanks Ted,
>> > > > > > Matt
>> > > > > >
>> > > > > >
>> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <yu...@gmail.com>
>> > wrote:
>> > > > > >
>> > > > > > > Matt:
>> > > > > > > This is the method which made the logging:
>> > > > > > >   private static int tickleNodeSplit(ZooKeeperWatcher zkw,
>> > > > > > >       HRegionInfo parent, HRegionInfo a, HRegionInfo b,
>> > ServerName
>> > > > > > > serverName,
>> > > > > > >       final int znodeVersion)
>> > > > > > >   throws KeeperException, IOException {
>> > > > > > >     byte [] payload = Writables.getBytes(a, b);
>> > > > > > >     return ZKAssign.transitionNode(zkw, parent, serverName,
>> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
>> EventType.RS_ZK_REGION_SPLIT,
>> > > > > > >       znodeVersion, payload);
>> > > > > > >   }
>> > > > > > >
>> > > > > > > transitionZKNode() calls tickleNodeSplit() when waiting for
>> > master
>> > > to
>> > > > > > split
>> > > > > > > the region. Obviously something caused the master not able to
>> > > split.
>> > > > > > >
>> > > > > > > How large is the region ?
>> > > > > > >
>> > > > > > > Can you pastebin master log for that period of time ?
>> > > > > > >
>> > > > > > > Thanks
>> > > > > > >
>> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <
>> mcorgan@hotpads.com
>> > >
>> > > > > wrote:
>> > > > > > >
>> > > > > > > > We upgraded from .94.0 to .94.2 last week and have started to
>> > > > > encounter
>> > > > > > > > infinite loops of region-transition on splits.  I'm not sure
>> > yet
>> > > if
>> > > > > > it's
>> > > > > > > > all splits nor if it's related to load.  Solution so far has
>> > been
>> > > > to
>> > > > > > > > restart the regionserver process.
>> > > > > > > >
>> > > > > > > > log snippet:
>> > > > > > > > http://pastebin.com/LpienZ7B
>> > > > > > > >
>> > > > > > > > It's repeating these two lines:
>> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
>> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > > > > > regionserver:60020-0x13ab46479832b76 Attempting to transition
>> > > node
>> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
>> > > > > > > > RS_ZK_REGION_SPLIT
>> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
>> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > > > > > regionserver:60020-0x13ab46479832b76 Successfully
>> transitioned
>> > > node
>> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
>> > > > > > > > RS_ZK_REGION_SPLIT
>> > > > > > > >
>> > > > > > > > with the occasional:
>> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
>> > > > > > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still
>> > > > waiting
>> > > > > on
>> > > > > > > the
>> > > > > > > > master to process the split for
>> > cf3e9bc069e1888983c06dc8e053ffcf
>> > > > > > > >
>> > > > > > > > Should the region transition from RS_ZK_REGION_SPLIT to
>> itself?
>> > >  It
>> > > > > > looks
>> > > > > > > > wrong, but I'm not familiar with the code at all.
>> > > > > > > >
>> > > > > > > > Thanks,
>> > > > > > > > Matt
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by Matt Corgan <mc...@hotpads.com>.

Forgot to answer Ted's question - I don't remember exactly what we did to
the cluster.  I don't think there were any regionserver crashes, but there
were probably controlled restarts, new servers added to the running
cluster, and some regions moved by our manual balancer job.


On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <mc...@hotpads.com> wrote:

> I've had a few successful splits today without any errors.  I'll turn up
> the importer speed tomorrow to start stressing the cluster more.
>
>
>
> On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> Matt:
>> Was there any region server crash before you noticed the repeated log for
>> bc62a8a72124a4ba3f6b9f302587903c ?
>>
>> I wonder if the following JIRA might be related:
>> HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter
>>
>> Thanks
>>
>> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <mc...@hotpads.com> wrote:
>>
>> > Strangely, I don't see any record of that region in the master before
>> what
>> > I already pasted even though I have logs back to 10/30.  Next time it
>> > happens I'll gather a full log record and try to debug while it's
>> > occurring.
>> >
>> >
>> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
>> > chrajeshbabu32@gmail.com> wrote:
>> >
>> > > Hi Matt,
>> > > can you paste some more master logs of region
>> > > bc62a8a72124a4ba3f6b9f30258790 before split.
>> > > I think Its not problem with splitting.
>> > > We are getting
>> > >       LOG.warn("Region " + encodedName + " not found on server " +
>> > > serverName +
>> > >         "; failed processing");
>> > > this log means no entry in servers map(not fully assigned).
>> > >     Set<HRegionInfo> hris = this.servers.get(sn);
>> > >     HRegionInfo foundHri = null;
>> > >     for (HRegionInfo hri: hris) {
>> > >       if (hri.getEncodedName().equals(encodedName)) {
>> > >         foundHri = hri;
>> > >         break;
>> > >       }
>> > >     }
>> > >     return foundHri;
>> > >
>> > >
>> > >
>> > >
>> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <lh...@yahoo.com>
>> > wrote:
>> > >
>> > > > CC'ing dev list...
>> > > >
>> > > > Is anybody aware of any changes that went in recently that could
>> cause
>> > > > this?
>> > > > I looked around a bit, but could not find anything obvious.
>> > > >
>> > > > -- Lars
>> > > >
>> > > >
>> > > >
>> > > > ________________________________
>> > > >  From: Matt Corgan <mc...@hotpads.com>
>> > > > To: user <us...@hbase.apache.org>
>> > > > Sent: Saturday, November 3, 2012 5:27 PM
>> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>> > > >
>> > > > I think the cluster is ok without running hbck, as restarting the
>> > > > regionserver process stops the warnings and everything looks ok
>> > > otherwise.
>> > > >
>> > > > here's the regionserver right after the split happens:
>> > > > ------------------------
>> > > > 2012-11-01 22:45:28,726 DEBUG
>> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node
>> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
>> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
>> > > > 2012-11-01 22:45:28,730 DEBUG
>> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node
>> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
>> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
>> > > > 2012-11-01 22:45:28,730 DEBUG
>> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still
>> waiting on
>> > > the
>> > > > master to process the split for bc62a8a72124a4ba3f6b9f302587903c
>> > > > 2012-11-01 22:45:28,832 DEBUG
>> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node
>> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
>> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
>> > > > 2012-11-01 22:45:28,837 DEBUG
>> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node
>> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
>> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
>> > > > -----------------------------
>> > > >
>> > > > The "transitioned node from RS_ZK_REGION_SPLIT to
>> RS_ZK_REGION_SPLIT"
>> > > > continues for 15 or so hours and finally settles without manual
>> > > > intervention with these regionserver log messages:
>> > > > -----------------------
>> > > > 2012-11-02 13:55:00,906 DEBUG
>> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node *
>> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
>> > > > RS_ZK_REGION_SPLIT
>> > > >
>> > > > 2012-11-02 13:55:00,916 DEBUG
>> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node
>> *
>> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
>> > > > RS_ZK_REGION_SPLIT
>> > > >
>> > > > 2012-11-02 13:55:00,916 INFO
>> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split,
>> META
>> > > > updated, and report to master.
>> > > >
>> > > >
>> > >
>> >
>> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
>> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
>> > > >
>> > > >
>> > >
>> >
>> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
>> > > >
>> > > >
>> > >
>> >
>> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
>> > > > Split took 15hrs, 9mins, 33sec
>> > > >
>> > > > 2012-11-02 13:55:00,945 INFO
>> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split,
>> META
>> > > > updated, and report to master.
>> > > >
>> > > >
>> > >
>> >
>> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
>> > > > new regions:
>> > > >
>> > > >
>> > >
>> >
>> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
>> > > >
>> > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
>> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins, 2sec
>> > > > ----------------------
>> > > >
>> > > > And the master finally logs this:
>> > > > -----------------------
>> > > > 2012-11-02 13:55:00,783 WARN
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
>> > > > bc62a8a72124a4ba3f6b9f302587903c not found on server
>> > > > HadoopNode162.hotpads.srv,60020,1351788248279; failed processing
>> > > >
>> > > > 2012-11-02 13:55:00,783 WARN
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for
>> > > region
>> > > > bc62a8a72124a4ba3f6b9f302587903c from server
>> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist
>> > > anymore,
>> > > > probably already processed its split
>> > > > ------------------------
>> > > >
>> > > > I can't find any evidence that the region left the node during that
>> > time.
>> > > > I'll have to catch it in action next time and see what the region
>> is up
>> > > to
>> > > > during the problem period.
>> > > >
>> > > > What does it mean that it successfully transitioned from SPLIT to
>> > SPLIT?
>> > > > Is that a valid transition?
>> > > >
>> > > > Matt
>> > > >
>> > > >
>> > > >
>> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <yu...@gmail.com> wrote:
>> > > >
>> > > > > Matt:
>> > > > > From the following we can see that region
>> > > > bc62a8a72124a4ba3f6b9f302587903c
>> > > > > cannot be found:
>> > > > >
>> > > > > 2012-11-02 00:00:02,909 DEBUG
>> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling
>> > > > > transition=RS_ZK_REGION_SPLIT,
>> > > > > server=HadoopNode162.hotpads.srv,60020,1351788248279,
>> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
>> > > > > 2012-11-02 00:00:02,909 WARN
>> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
>> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on server
>> > > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
>> > > > > processing
>> > > > > 2012-11-02 00:00:02,909 WARN
>> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT
>> for
>> > > > region
>> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
>> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist
>> > > > anymore,
>> > > > > probably already processed its split
>> > > > >
>> > > > > Have you run hbck to repair your cluster ?
>> > > > >
>> > > > > Thanks
>> > > > >
>> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <mc...@hotpads.com>
>> > > wrote:
>> > > > >
>> > > > > > Here's a sample of the master's logs from yesterday.  It's not
>> > > > correlated
>> > > > > > exactly with the other pastebin log, but there's 3GB of this
>> from
>> > > > > > yesterday: http://pastebin.com/wP2rNN1t
>> > > > > >
>> > > > > > I'm am pushing the cluster a bit with importing data so testing
>> the
>> > > > split
>> > > > > > code harder than normal.  The regions are 500-1GB gzipped.  I
>> can
>> > > look
>> > > > > into
>> > > > > > it more but trying to figure out what to look for.
>> > > > > >
>> > > > > > Thanks Ted,
>> > > > > > Matt
>> > > > > >
>> > > > > >
>> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <yu...@gmail.com>
>> > wrote:
>> > > > > >
>> > > > > > > Matt:
>> > > > > > > This is the method which made the logging:
>> > > > > > >   private static int tickleNodeSplit(ZooKeeperWatcher zkw,
>> > > > > > >       HRegionInfo parent, HRegionInfo a, HRegionInfo b,
>> > ServerName
>> > > > > > > serverName,
>> > > > > > >       final int znodeVersion)
>> > > > > > >   throws KeeperException, IOException {
>> > > > > > >     byte [] payload = Writables.getBytes(a, b);
>> > > > > > >     return ZKAssign.transitionNode(zkw, parent, serverName,
>> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
>> EventType.RS_ZK_REGION_SPLIT,
>> > > > > > >       znodeVersion, payload);
>> > > > > > >   }
>> > > > > > >
>> > > > > > > transitionZKNode() calls tickleNodeSplit() when waiting for
>> > master
>> > > to
>> > > > > > split
>> > > > > > > the region. Obviously something caused the master not able to
>> > > split.
>> > > > > > >
>> > > > > > > How large is the region ?
>> > > > > > >
>> > > > > > > Can you pastebin master log for that period of time ?
>> > > > > > >
>> > > > > > > Thanks
>> > > > > > >
>> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <
>> mcorgan@hotpads.com
>> > >
>> > > > > wrote:
>> > > > > > >
>> > > > > > > > We upgraded from .94.0 to .94.2 last week and have started
>> to
>> > > > > encounter
>> > > > > > > > infinite loops of region-transition on splits.  I'm not sure
>> > yet
>> > > if
>> > > > > > it's
>> > > > > > > > all splits nor if it's related to load.  Solution so far has
>> > been
>> > > > to
>> > > > > > > > restart the regionserver process.
>> > > > > > > >
>> > > > > > > > log snippet:
>> > > > > > > > http://pastebin.com/LpienZ7B
>> > > > > > > >
>> > > > > > > > It's repeating these two lines:
>> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
>> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > > > > > regionserver:60020-0x13ab46479832b76 Attempting to
>> transition
>> > > node
>> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
>> > > > > > > > RS_ZK_REGION_SPLIT
>> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
>> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > > > > > > regionserver:60020-0x13ab46479832b76 Successfully
>> transitioned
>> > > node
>> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
>> > > > > > > > RS_ZK_REGION_SPLIT
>> > > > > > > >
>> > > > > > > > with the occasional:
>> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
>> > > > > > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still
>> > > > waiting
>> > > > > on
>> > > > > > > the
>> > > > > > > > master to process the split for
>> > cf3e9bc069e1888983c06dc8e053ffcf
>> > > > > > > >
>> > > > > > > > Should the region transition from RS_ZK_REGION_SPLIT to
>> itself?
>> > >  It
>> > > > > > looks
>> > > > > > > > wrong, but I'm not familiar with the code at all.
>> > > > > > > >
>> > > > > > > > Thanks,
>> > > > > > > > Matt
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by Matt Corgan <mc...@hotpads.com>.

I've had a few successful splits today without any errors.  I'll turn up
the importer speed tomorrow to start stressing the cluster more.


On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <yu...@gmail.com> wrote:

> Matt:
> Was there any region server crash before you noticed the repeated log for
> bc62a8a72124a4ba3f6b9f302587903c ?
>
> I wonder if the following JIRA might be related:
> HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter
>
> Thanks
>
> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
> > Strangely, I don't see any record of that region in the master before
> what
> > I already pasted even though I have logs back to 10/30.  Next time it
> > happens I'll gather a full log record and try to debug while it's
> > occurring.
> >
> >
> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
> > chrajeshbabu32@gmail.com> wrote:
> >
> > > Hi Matt,
> > > can you paste some more master logs of region
> > > bc62a8a72124a4ba3f6b9f30258790 before split.
> > > I think Its not problem with splitting.
> > > We are getting
> > >       LOG.warn("Region " + encodedName + " not found on server " +
> > > serverName +
> > >         "; failed processing");
> > > this log means no entry in servers map(not fully assigned).
> > >     Set<HRegionInfo> hris = this.servers.get(sn);
> > >     HRegionInfo foundHri = null;
> > >     for (HRegionInfo hri: hris) {
> > >       if (hri.getEncodedName().equals(encodedName)) {
> > >         foundHri = hri;
> > >         break;
> > >       }
> > >     }
> > >     return foundHri;
> > >
> > >
> > >
> > >
> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <lh...@yahoo.com>
> > wrote:
> > >
> > > > CC'ing dev list...
> > > >
> > > > Is anybody aware of any changes that went in recently that could
> cause
> > > > this?
> > > > I looked around a bit, but could not find anything obvious.
> > > >
> > > > -- Lars
> > > >
> > > >
> > > >
> > > > ________________________________
> > > >  From: Matt Corgan <mc...@hotpads.com>
> > > > To: user <us...@hbase.apache.org>
> > > > Sent: Saturday, November 3, 2012 5:27 PM
> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> > > >
> > > > I think the cluster is ok without running hbck, as restarting the
> > > > regionserver process stops the warnings and everything looks ok
> > > otherwise.
> > > >
> > > > here's the regionserver right after the split happens:
> > > > ------------------------
> > > > 2012-11-01 22:45:28,726 DEBUG
> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node
> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> > > > 2012-11-01 22:45:28,730 DEBUG
> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node
> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> > > > 2012-11-01 22:45:28,730 DEBUG
> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting
> on
> > > the
> > > > master to process the split for bc62a8a72124a4ba3f6b9f302587903c
> > > > 2012-11-01 22:45:28,832 DEBUG
> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node
> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
> > > > 2012-11-01 22:45:28,837 DEBUG
> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node
> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
> > > > -----------------------------
> > > >
> > > > The "transitioned node from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT"
> > > > continues for 15 or so hours and finally settles without manual
> > > > intervention with these regionserver log messages:
> > > > -----------------------
> > > > 2012-11-02 13:55:00,906 DEBUG
> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node *
> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> > > > RS_ZK_REGION_SPLIT
> > > >
> > > > 2012-11-02 13:55:00,916 DEBUG
> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node *
> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> > > > RS_ZK_REGION_SPLIT
> > > >
> > > > 2012-11-02 13:55:00,916 INFO
> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
> > > > updated, and report to master.
> > > >
> > > >
> > >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
> > > >
> > > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> > > >
> > > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> > > > Split took 15hrs, 9mins, 33sec
> > > >
> > > > 2012-11-02 13:55:00,945 INFO
> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
> > > > updated, and report to master.
> > > >
> > > >
> > >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> > > > new regions:
> > > >
> > > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> > > >
> > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins, 2sec
> > > > ----------------------
> > > >
> > > > And the master finally logs this:
> > > > -----------------------
> > > > 2012-11-02 13:55:00,783 WARN
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > > > bc62a8a72124a4ba3f6b9f302587903c not found on server
> > > > HadoopNode162.hotpads.srv,60020,1351788248279; failed processing
> > > >
> > > > 2012-11-02 13:55:00,783 WARN
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for
> > > region
> > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist
> > > anymore,
> > > > probably already processed its split
> > > > ------------------------
> > > >
> > > > I can't find any evidence that the region left the node during that
> > time.
> > > > I'll have to catch it in action next time and see what the region is
> up
> > > to
> > > > during the problem period.
> > > >
> > > > What does it mean that it successfully transitioned from SPLIT to
> > SPLIT?
> > > > Is that a valid transition?
> > > >
> > > > Matt
> > > >
> > > >
> > > >
> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <yu...@gmail.com> wrote:
> > > >
> > > > > Matt:
> > > > > From the following we can see that region
> > > > bc62a8a72124a4ba3f6b9f302587903c
> > > > > cannot be found:
> > > > >
> > > > > 2012-11-02 00:00:02,909 DEBUG
> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling
> > > > > transition=RS_ZK_REGION_SPLIT,
> > > > > server=HadoopNode162.hotpads.srv,60020,1351788248279,
> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
> > > > > 2012-11-02 00:00:02,909 WARN
> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on server
> > > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> > > > > processing
> > > > > 2012-11-02 00:00:02,909 WARN
> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT
> for
> > > > region
> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist
> > > > anymore,
> > > > > probably already processed its split
> > > > >
> > > > > Have you run hbck to repair your cluster ?
> > > > >
> > > > > Thanks
> > > > >
> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <mc...@hotpads.com>
> > > wrote:
> > > > >
> > > > > > Here's a sample of the master's logs from yesterday.  It's not
> > > > correlated
> > > > > > exactly with the other pastebin log, but there's 3GB of this from
> > > > > > yesterday: http://pastebin.com/wP2rNN1t
> > > > > >
> > > > > > I'm am pushing the cluster a bit with importing data so testing
> the
> > > > split
> > > > > > code harder than normal.  The regions are 500-1GB gzipped.  I can
> > > look
> > > > > into
> > > > > > it more but trying to figure out what to look for.
> > > > > >
> > > > > > Thanks Ted,
> > > > > > Matt
> > > > > >
> > > > > >
> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <yu...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > Matt:
> > > > > > > This is the method which made the logging:
> > > > > > >   private static int tickleNodeSplit(ZooKeeperWatcher zkw,
> > > > > > >       HRegionInfo parent, HRegionInfo a, HRegionInfo b,
> > ServerName
> > > > > > > serverName,
> > > > > > >       final int znodeVersion)
> > > > > > >   throws KeeperException, IOException {
> > > > > > >     byte [] payload = Writables.getBytes(a, b);
> > > > > > >     return ZKAssign.transitionNode(zkw, parent, serverName,
> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
> EventType.RS_ZK_REGION_SPLIT,
> > > > > > >       znodeVersion, payload);
> > > > > > >   }
> > > > > > >
> > > > > > > transitionZKNode() calls tickleNodeSplit() when waiting for
> > master
> > > to
> > > > > > split
> > > > > > > the region. Obviously something caused the master not able to
> > > split.
> > > > > > >
> > > > > > > How large is the region ?
> > > > > > >
> > > > > > > Can you pastebin master log for that period of time ?
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <
> mcorgan@hotpads.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > We upgraded from .94.0 to .94.2 last week and have started to
> > > > > encounter
> > > > > > > > infinite loops of region-transition on splits.  I'm not sure
> > yet
> > > if
> > > > > > it's
> > > > > > > > all splits nor if it's related to load.  Solution so far has
> > been
> > > > to
> > > > > > > > restart the regionserver process.
> > > > > > > >
> > > > > > > > log snippet:
> > > > > > > > http://pastebin.com/LpienZ7B
> > > > > > > >
> > > > > > > > It's repeating these two lines:
> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > > > > regionserver:60020-0x13ab46479832b76 Attempting to transition
> > > node
> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
> > > > > > > > RS_ZK_REGION_SPLIT
> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > > > > regionserver:60020-0x13ab46479832b76 Successfully
> transitioned
> > > node
> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
> > > > > > > > RS_ZK_REGION_SPLIT
> > > > > > > >
> > > > > > > > with the occasional:
> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
> > > > > > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still
> > > > waiting
> > > > > on
> > > > > > > the
> > > > > > > > master to process the split for
> > cf3e9bc069e1888983c06dc8e053ffcf
> > > > > > > >
> > > > > > > > Should the region transition from RS_ZK_REGION_SPLIT to
> itself?
> > >  It
> > > > > > looks
> > > > > > > > wrong, but I'm not familiar with the code at all.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Matt
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by Ted Yu <yu...@gmail.com>.

Matt:
Was there any region server crash before you noticed the repeated log for
bc62a8a72124a4ba3f6b9f302587903c ?

I wonder if the following JIRA might be related:
HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter

Thanks

On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <mc...@hotpads.com> wrote:

> Strangely, I don't see any record of that region in the master before what
> I already pasted even though I have logs back to 10/30.  Next time it
> happens I'll gather a full log record and try to debug while it's
> occurring.
>
>
> On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
> chrajeshbabu32@gmail.com> wrote:
>
> > Hi Matt,
> > can you paste some more master logs of region
> > bc62a8a72124a4ba3f6b9f30258790 before split.
> > I think Its not problem with splitting.
> > We are getting
> >       LOG.warn("Region " + encodedName + " not found on server " +
> > serverName +
> >         "; failed processing");
> > this log means no entry in servers map(not fully assigned).
> >     Set<HRegionInfo> hris = this.servers.get(sn);
> >     HRegionInfo foundHri = null;
> >     for (HRegionInfo hri: hris) {
> >       if (hri.getEncodedName().equals(encodedName)) {
> >         foundHri = hri;
> >         break;
> >       }
> >     }
> >     return foundHri;
> >
> >
> >
> >
> > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <lh...@yahoo.com>
> wrote:
> >
> > > CC'ing dev list...
> > >
> > > Is anybody aware of any changes that went in recently that could cause
> > > this?
> > > I looked around a bit, but could not find anything obvious.
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ________________________________
> > >  From: Matt Corgan <mc...@hotpads.com>
> > > To: user <us...@hbase.apache.org>
> > > Sent: Saturday, November 3, 2012 5:27 PM
> > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> > >
> > > I think the cluster is ok without running hbck, as restarting the
> > > regionserver process stops the warnings and everything looks ok
> > otherwise.
> > >
> > > here's the regionserver right after the split happens:
> > > ------------------------
> > > 2012-11-01 22:45:28,726 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > regionserver:60020-0x13ab46479832953 Attempting to transition node
> > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> > > 2012-11-01 22:45:28,730 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > regionserver:60020-0x13ab46479832953 Successfully transitioned node
> > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> > > 2012-11-01 22:45:28,730 DEBUG
> > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on
> > the
> > > master to process the split for bc62a8a72124a4ba3f6b9f302587903c
> > > 2012-11-01 22:45:28,832 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > regionserver:60020-0x13ab46479832953 Attempting to transition node
> > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> > > EGION_SPLIT to RS_ZK_REGION_SPLIT
> > > 2012-11-01 22:45:28,837 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > regionserver:60020-0x13ab46479832953 Successfully transitioned node
> > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > > REGION_SPLIT to RS_ZK_REGION_SPLIT
> > > -----------------------------
> > >
> > > The "transitioned node from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT"
> > > continues for 15 or so hours and finally settles without manual
> > > intervention with these regionserver log messages:
> > > -----------------------
> > > 2012-11-02 13:55:00,906 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > regionserver:60020-0x13ab46479832953 Attempting to transition node *
> > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> > > RS_ZK_REGION_SPLIT
> > >
> > > 2012-11-02 13:55:00,916 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > regionserver:60020-0x13ab46479832953 Successfully transitioned node *
> > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> > > RS_ZK_REGION_SPLIT
> > >
> > > 2012-11-02 13:55:00,916 INFO
> > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
> > > updated, and report to master.
> > >
> > >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
> > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> > > Split took 15hrs, 9mins, 33sec
> > >
> > > 2012-11-02 13:55:00,945 INFO
> > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
> > > updated, and report to master.
> > >
> > >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> > > new regions:
> > >
> > >
> >
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> > >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins, 2sec
> > > ----------------------
> > >
> > > And the master finally logs this:
> > > -----------------------
> > > 2012-11-02 13:55:00,783 WARN
> > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > > bc62a8a72124a4ba3f6b9f302587903c not found on server
> > > HadoopNode162.hotpads.srv,60020,1351788248279; failed processing
> > >
> > > 2012-11-02 13:55:00,783 WARN
> > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for
> > region
> > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist
> > anymore,
> > > probably already processed its split
> > > ------------------------
> > >
> > > I can't find any evidence that the region left the node during that
> time.
> > > I'll have to catch it in action next time and see what the region is up
> > to
> > > during the problem period.
> > >
> > > What does it mean that it successfully transitioned from SPLIT to
> SPLIT?
> > > Is that a valid transition?
> > >
> > > Matt
> > >
> > >
> > >
> > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > Matt:
> > > > From the following we can see that region
> > > bc62a8a72124a4ba3f6b9f302587903c
> > > > cannot be found:
> > > >
> > > > 2012-11-02 00:00:02,909 DEBUG
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling
> > > > transition=RS_ZK_REGION_SPLIT,
> > > > server=HadoopNode162.hotpads.srv,60020,1351788248279,
> > > > region=bc62a8a72124a4ba3f6b9f302587903c
> > > > 2012-11-02 00:00:02,909 WARN
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > > > bc62a8a72124a4ba3f6b9f302587903c *not found on server
> > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> > > > processing
> > > > 2012-11-02 00:00:02,909 WARN
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for
> > > region
> > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist
> > > anymore,
> > > > probably already processed its split
> > > >
> > > > Have you run hbck to repair your cluster ?
> > > >
> > > > Thanks
> > > >
> > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <mc...@hotpads.com>
> > wrote:
> > > >
> > > > > Here's a sample of the master's logs from yesterday.  It's not
> > > correlated
> > > > > exactly with the other pastebin log, but there's 3GB of this from
> > > > > yesterday: http://pastebin.com/wP2rNN1t
> > > > >
> > > > > I'm am pushing the cluster a bit with importing data so testing the
> > > split
> > > > > code harder than normal.  The regions are 500-1GB gzipped.  I can
> > look
> > > > into
> > > > > it more but trying to figure out what to look for.
> > > > >
> > > > > Thanks Ted,
> > > > > Matt
> > > > >
> > > > >
> > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > > > >
> > > > > > Matt:
> > > > > > This is the method which made the logging:
> > > > > >   private static int tickleNodeSplit(ZooKeeperWatcher zkw,
> > > > > >       HRegionInfo parent, HRegionInfo a, HRegionInfo b,
> ServerName
> > > > > > serverName,
> > > > > >       final int znodeVersion)
> > > > > >   throws KeeperException, IOException {
> > > > > >     byte [] payload = Writables.getBytes(a, b);
> > > > > >     return ZKAssign.transitionNode(zkw, parent, serverName,
> > > > > >       EventType.RS_ZK_REGION_SPLIT, EventType.RS_ZK_REGION_SPLIT,
> > > > > >       znodeVersion, payload);
> > > > > >   }
> > > > > >
> > > > > > transitionZKNode() calls tickleNodeSplit() when waiting for
> master
> > to
> > > > > split
> > > > > > the region. Obviously something caused the master not able to
> > split.
> > > > > >
> > > > > > How large is the region ?
> > > > > >
> > > > > > Can you pastebin master log for that period of time ?
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <mcorgan@hotpads.com
> >
> > > > wrote:
> > > > > >
> > > > > > > We upgraded from .94.0 to .94.2 last week and have started to
> > > > encounter
> > > > > > > infinite loops of region-transition on splits.  I'm not sure
> yet
> > if
> > > > > it's
> > > > > > > all splits nor if it's related to load.  Solution so far has
> been
> > > to
> > > > > > > restart the regionserver process.
> > > > > > >
> > > > > > > log snippet:
> > > > > > > http://pastebin.com/LpienZ7B
> > > > > > >
> > > > > > > It's repeating these two lines:
> > > > > > > 2012-11-02 01:35:33,312 DEBUG
> > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > > > regionserver:60020-0x13ab46479832b76 Attempting to transition
> > node
> > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
> > > > > > > RS_ZK_REGION_SPLIT
> > > > > > > 2012-11-02 01:35:33,364 DEBUG
> > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > > > regionserver:60020-0x13ab46479832b76 Successfully transitioned
> > node
> > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
> > > > > > > RS_ZK_REGION_SPLIT
> > > > > > >
> > > > > > > with the occasional:
> > > > > > > 2012-11-02 01:35:34,476 DEBUG
> > > > > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still
> > > waiting
> > > > on
> > > > > > the
> > > > > > > master to process the split for
> cf3e9bc069e1888983c06dc8e053ffcf
> > > > > > >
> > > > > > > Should the region transition from RS_ZK_REGION_SPLIT to itself?
> >  It
> > > > > looks
> > > > > > > wrong, but I'm not familiar with the code at all.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Matt
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by Matt Corgan <mc...@hotpads.com>.

Strangely, I don't see any record of that region in the master before what
I already pasted even though I have logs back to 10/30.  Next time it
happens I'll gather a full log record and try to debug while it's occurring.


On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
chrajeshbabu32@gmail.com> wrote:

> Hi Matt,
> can you paste some more master logs of region
> bc62a8a72124a4ba3f6b9f30258790 before split.
> I think Its not problem with splitting.
> We are getting
>       LOG.warn("Region " + encodedName + " not found on server " +
> serverName +
>         "; failed processing");
> this log means no entry in servers map(not fully assigned).
>     Set<HRegionInfo> hris = this.servers.get(sn);
>     HRegionInfo foundHri = null;
>     for (HRegionInfo hri: hris) {
>       if (hri.getEncodedName().equals(encodedName)) {
>         foundHri = hri;
>         break;
>       }
>     }
>     return foundHri;
>
>
>
>
> On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <lh...@yahoo.com> wrote:
>
> > CC'ing dev list...
> >
> > Is anybody aware of any changes that went in recently that could cause
> > this?
> > I looked around a bit, but could not find anything obvious.
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Matt Corgan <mc...@hotpads.com>
> > To: user <us...@hbase.apache.org>
> > Sent: Saturday, November 3, 2012 5:27 PM
> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> >
> > I think the cluster is ok without running hbck, as restarting the
> > regionserver process stops the warnings and everything looks ok
> otherwise.
> >
> > here's the regionserver right after the split happens:
> > ------------------------
> > 2012-11-01 22:45:28,726 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > regionserver:60020-0x13ab46479832953 Attempting to transition node
> > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> > 2012-11-01 22:45:28,730 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > regionserver:60020-0x13ab46479832953 Successfully transitioned node
> > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> > 2012-11-01 22:45:28,730 DEBUG
> > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on
> the
> > master to process the split for bc62a8a72124a4ba3f6b9f302587903c
> > 2012-11-01 22:45:28,832 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > regionserver:60020-0x13ab46479832953 Attempting to transition node
> > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> > EGION_SPLIT to RS_ZK_REGION_SPLIT
> > 2012-11-01 22:45:28,837 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > regionserver:60020-0x13ab46479832953 Successfully transitioned node
> > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > REGION_SPLIT to RS_ZK_REGION_SPLIT
> > -----------------------------
> >
> > The "transitioned node from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT"
> > continues for 15 or so hours and finally settles without manual
> > intervention with these regionserver log messages:
> > -----------------------
> > 2012-11-02 13:55:00,906 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > regionserver:60020-0x13ab46479832953 Attempting to transition node *
> > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> > RS_ZK_REGION_SPLIT
> >
> > 2012-11-02 13:55:00,916 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > regionserver:60020-0x13ab46479832953 Successfully transitioned node *
> > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> > RS_ZK_REGION_SPLIT
> >
> > 2012-11-02 13:55:00,916 INFO
> > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
> > updated, and report to master.
> >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
> >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> > Split took 15hrs, 9mins, 33sec
> >
> > 2012-11-02 13:55:00,945 INFO
> > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
> > updated, and report to master.
> >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> > new regions:
> >
> >
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins, 2sec
> > ----------------------
> >
> > And the master finally logs this:
> > -----------------------
> > 2012-11-02 13:55:00,783 WARN
> > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > bc62a8a72124a4ba3f6b9f302587903c not found on server
> > HadoopNode162.hotpads.srv,60020,1351788248279; failed processing
> >
> > 2012-11-02 13:55:00,783 WARN
> > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for
> region
> > bc62a8a72124a4ba3f6b9f302587903c from server
> > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist
> anymore,
> > probably already processed its split
> > ------------------------
> >
> > I can't find any evidence that the region left the node during that time.
> > I'll have to catch it in action next time and see what the region is up
> to
> > during the problem period.
> >
> > What does it mean that it successfully transitioned from SPLIT to SPLIT?
> > Is that a valid transition?
> >
> > Matt
> >
> >
> >
> > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Matt:
> > > From the following we can see that region
> > bc62a8a72124a4ba3f6b9f302587903c
> > > cannot be found:
> > >
> > > 2012-11-02 00:00:02,909 DEBUG
> > > org.apache.hadoop.hbase.master.AssignmentManager: Handling
> > > transition=RS_ZK_REGION_SPLIT,
> > > server=HadoopNode162.hotpads.srv,60020,1351788248279,
> > > region=bc62a8a72124a4ba3f6b9f302587903c
> > > 2012-11-02 00:00:02,909 WARN
> > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > > bc62a8a72124a4ba3f6b9f302587903c *not found on server
> > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> > > processing
> > > 2012-11-02 00:00:02,909 WARN
> > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for
> > region
> > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist
> > anymore,
> > > probably already processed its split
> > >
> > > Have you run hbck to repair your cluster ?
> > >
> > > Thanks
> > >
> > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <mc...@hotpads.com>
> wrote:
> > >
> > > > Here's a sample of the master's logs from yesterday.  It's not
> > correlated
> > > > exactly with the other pastebin log, but there's 3GB of this from
> > > > yesterday: http://pastebin.com/wP2rNN1t
> > > >
> > > > I'm am pushing the cluster a bit with importing data so testing the
> > split
> > > > code harder than normal.  The regions are 500-1GB gzipped.  I can
> look
> > > into
> > > > it more but trying to figure out what to look for.
> > > >
> > > > Thanks Ted,
> > > > Matt
> > > >
> > > >
> > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <yu...@gmail.com> wrote:
> > > >
> > > > > Matt:
> > > > > This is the method which made the logging:
> > > > >   private static int tickleNodeSplit(ZooKeeperWatcher zkw,
> > > > >       HRegionInfo parent, HRegionInfo a, HRegionInfo b, ServerName
> > > > > serverName,
> > > > >       final int znodeVersion)
> > > > >   throws KeeperException, IOException {
> > > > >     byte [] payload = Writables.getBytes(a, b);
> > > > >     return ZKAssign.transitionNode(zkw, parent, serverName,
> > > > >       EventType.RS_ZK_REGION_SPLIT, EventType.RS_ZK_REGION_SPLIT,
> > > > >       znodeVersion, payload);
> > > > >   }
> > > > >
> > > > > transitionZKNode() calls tickleNodeSplit() when waiting for master
> to
> > > > split
> > > > > the region. Obviously something caused the master not able to
> split.
> > > > >
> > > > > How large is the region ?
> > > > >
> > > > > Can you pastebin master log for that period of time ?
> > > > >
> > > > > Thanks
> > > > >
> > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <mc...@hotpads.com>
> > > wrote:
> > > > >
> > > > > > We upgraded from .94.0 to .94.2 last week and have started to
> > > encounter
> > > > > > infinite loops of region-transition on splits.  I'm not sure yet
> if
> > > > it's
> > > > > > all splits nor if it's related to load.  Solution so far has been
> > to
> > > > > > restart the regionserver process.
> > > > > >
> > > > > > log snippet:
> > > > > > http://pastebin.com/LpienZ7B
> > > > > >
> > > > > > It's repeating these two lines:
> > > > > > 2012-11-02 01:35:33,312 DEBUG
> > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > > regionserver:60020-0x13ab46479832b76 Attempting to transition
> node
> > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
> > > > > > RS_ZK_REGION_SPLIT
> > > > > > 2012-11-02 01:35:33,364 DEBUG
> > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > > regionserver:60020-0x13ab46479832b76 Successfully transitioned
> node
> > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
> > > > > > RS_ZK_REGION_SPLIT
> > > > > >
> > > > > > with the occasional:
> > > > > > 2012-11-02 01:35:34,476 DEBUG
> > > > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still
> > waiting
> > > on
> > > > > the
> > > > > > master to process the split for cf3e9bc069e1888983c06dc8e053ffcf
> > > > > >
> > > > > > Should the region transition from RS_ZK_REGION_SPLIT to itself?
>  It
> > > > looks
> > > > > > wrong, but I'm not familiar with the code at all.
> > > > > >
> > > > > > Thanks,
> > > > > > Matt
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2

Posted by rajesh babu chintaguntla <ch...@gmail.com>.

Hi Matt,
can you paste some more master logs of region
bc62a8a72124a4ba3f6b9f30258790 before split.
I think Its not problem with splitting.
We are getting
      LOG.warn("Region " + encodedName + " not found on server " +
serverName +
        "; failed processing");
this log means no entry in servers map(not fully assigned).
    Set<HRegionInfo> hris = this.servers.get(sn);
    HRegionInfo foundHri = null;
    for (HRegionInfo hri: hris) {
      if (hri.getEncodedName().equals(encodedName)) {
        foundHri = hri;
        break;
      }
    }
    return foundHri;




On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <lh...@yahoo.com> wrote:

> CC'ing dev list...
>
> Is anybody aware of any changes that went in recently that could cause
> this?
> I looked around a bit, but could not find anything obvious.
>
> -- Lars
>
>
>
> ________________________________
>  From: Matt Corgan <mc...@hotpads.com>
> To: user <us...@hbase.apache.org>
> Sent: Saturday, November 3, 2012 5:27 PM
> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>
> I think the cluster is ok without running hbck, as restarting the
> regionserver process stops the warnings and everything looks ok otherwise.
>
> here's the regionserver right after the split happens:
> ------------------------
> 2012-11-01 22:45:28,726 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:60020-0x13ab46479832953 Attempting to transition node
> bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> 2012-11-01 22:45:28,730 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:60020-0x13ab46479832953 Successfully transitioned node
> bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> REGION_SPLITTING to RS_ZK_REGION_SPLIT
> 2012-11-01 22:45:28,730 DEBUG
> org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on the
> master to process the split for bc62a8a72124a4ba3f6b9f302587903c
> 2012-11-01 22:45:28,832 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:60020-0x13ab46479832953 Attempting to transition node
> bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> EGION_SPLIT to RS_ZK_REGION_SPLIT
> 2012-11-01 22:45:28,837 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:60020-0x13ab46479832953 Successfully transitioned node
> bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> REGION_SPLIT to RS_ZK_REGION_SPLIT
> -----------------------------
>
> The "transitioned node from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT"
> continues for 15 or so hours and finally settles without manual
> intervention with these regionserver log messages:
> -----------------------
> 2012-11-02 13:55:00,906 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:60020-0x13ab46479832953 Attempting to transition node *
> bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> RS_ZK_REGION_SPLIT
>
> 2012-11-02 13:55:00,916 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:60020-0x13ab46479832953 Successfully transitioned node *
> bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> RS_ZK_REGION_SPLIT
>
> 2012-11-02 13:55:00,916 INFO
> org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
> updated, and report to master.
>
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
>
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
>
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> Split took 15hrs, 9mins, 33sec
>
> 2012-11-02 13:55:00,945 INFO
> org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
> updated, and report to master.
>
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> new regions:
>
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins, 2sec
> ----------------------
>
> And the master finally logs this:
> -----------------------
> 2012-11-02 13:55:00,783 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Region
> bc62a8a72124a4ba3f6b9f302587903c not found on server
> HadoopNode162.hotpads.srv,60020,1351788248279; failed processing
>
> 2012-11-02 13:55:00,783 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for region
> bc62a8a72124a4ba3f6b9f302587903c from server
> HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist anymore,
> probably already processed its split
> ------------------------
>
> I can't find any evidence that the region left the node during that time.
> I'll have to catch it in action next time and see what the region is up to
> during the problem period.
>
> What does it mean that it successfully transitioned from SPLIT to SPLIT?
> Is that a valid transition?
>
> Matt
>
>
>
> On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Matt:
> > From the following we can see that region
> bc62a8a72124a4ba3f6b9f302587903c
> > cannot be found:
> >
> > 2012-11-02 00:00:02,909 DEBUG
> > org.apache.hadoop.hbase.master.AssignmentManager: Handling
> > transition=RS_ZK_REGION_SPLIT,
> > server=HadoopNode162.hotpads.srv,60020,1351788248279,
> > region=bc62a8a72124a4ba3f6b9f302587903c
> > 2012-11-02 00:00:02,909 WARN
> > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > bc62a8a72124a4ba3f6b9f302587903c *not found on server
> > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> > processing
> > 2012-11-02 00:00:02,909 WARN
> > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for
> region
> > bc62a8a72124a4ba3f6b9f302587903c from server
> > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist
> anymore,
> > probably already processed its split
> >
> > Have you run hbck to repair your cluster ?
> >
> > Thanks
> >
> > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <mc...@hotpads.com> wrote:
> >
> > > Here's a sample of the master's logs from yesterday.  It's not
> correlated
> > > exactly with the other pastebin log, but there's 3GB of this from
> > > yesterday: http://pastebin.com/wP2rNN1t
> > >
> > > I'm am pushing the cluster a bit with importing data so testing the
> split
> > > code harder than normal.  The regions are 500-1GB gzipped.  I can look
> > into
> > > it more but trying to figure out what to look for.
> > >
> > > Thanks Ted,
> > > Matt
> > >
> > >
> > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > Matt:
> > > > This is the method which made the logging:
> > > >   private static int tickleNodeSplit(ZooKeeperWatcher zkw,
> > > >       HRegionInfo parent, HRegionInfo a, HRegionInfo b, ServerName
> > > > serverName,
> > > >       final int znodeVersion)
> > > >   throws KeeperException, IOException {
> > > >     byte [] payload = Writables.getBytes(a, b);
> > > >     return ZKAssign.transitionNode(zkw, parent, serverName,
> > > >       EventType.RS_ZK_REGION_SPLIT, EventType.RS_ZK_REGION_SPLIT,
> > > >       znodeVersion, payload);
> > > >   }
> > > >
> > > > transitionZKNode() calls tickleNodeSplit() when waiting for master to
> > > split
> > > > the region. Obviously something caused the master not able to split.
> > > >
> > > > How large is the region ?
> > > >
> > > > Can you pastebin master log for that period of time ?
> > > >
> > > > Thanks
> > > >
> > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <mc...@hotpads.com>
> > wrote:
> > > >
> > > > > We upgraded from .94.0 to .94.2 last week and have started to
> > encounter
> > > > > infinite loops of region-transition on splits.  I'm not sure yet if
> > > it's
> > > > > all splits nor if it's related to load.  Solution so far has been
> to
> > > > > restart the regionserver process.
> > > > >
> > > > > log snippet:
> > > > > http://pastebin.com/LpienZ7B
> > > > >
> > > > > It's repeating these two lines:
> > > > > 2012-11-02 01:35:33,312 DEBUG
> > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > regionserver:60020-0x13ab46479832b76 Attempting to transition node
> > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
> > > > > RS_ZK_REGION_SPLIT
> > > > > 2012-11-02 01:35:33,364 DEBUG
> > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > regionserver:60020-0x13ab46479832b76 Successfully transitioned node
> > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
> > > > > RS_ZK_REGION_SPLIT
> > > > >
> > > > > with the occasional:
> > > > > 2012-11-02 01:35:34,476 DEBUG
> > > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still
> waiting
> > on
> > > > the
> > > > > master to process the split for cf3e9bc069e1888983c06dc8e053ffcf
> > > > >
> > > > > Should the region transition from RS_ZK_REGION_SPLIT to itself?  It
> > > looks
> > > > > wrong, but I'm not familiar with the code at all.
> > > > >
> > > > > Thanks,
> > > > > Matt
> > > > >
> > > >
> > >
> >
>