You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Joseph Pallas <jo...@oracle.com> on 2011/08/19 02:05:04 UTC

In search of a bug around splitting

I’m not exactly at my wit’s end, but I could use some insight here from someone who is intimately familiar with region splitting.  Any guidance (clearing up misunderstandings) would be great.

I’ll start with the main point: I think there may be a bug in HBase 0.90.3 that involves reading from a region while it is being split.  Below, I’ll explain where I think the bug might be and why I am not certain.

Here is the situation.  We have seen our test program turn up a mysterious failure on a small number of occasions.  The first couple of times we thought it must be a bug in our client code, put it into the “fix this later” category, and carried on.  Then we stopped seeing it for a while and figured it must have been fixed by some other change.  But when it next appeared we looked at it more closely, and we could not explain what we were seeing.

The test program has multiple client threads, each of which is performing a stream of operations (it's actually a custom workload running in the YCSB framework).  The program is keeping track of data that was inserted by write operations, and subsequent read operations only retrieve data that was previously written.  The read operation involves first doing a HTableInterface.exists call on a row/cf/qual that is expected to exist.  It is this exists call that we have seen fail.  When the failure occurs, the client reports an exception and stops.  Then we examine the data using the HBase shell, and the item we were looking for is there: the exists call should have succeeded.  Furthermore, the item has a timestamp that shows it really was inserted several minutes previously—it was not inserted right around the time of the failure (which might happen if there were a race condition of some sort in our client).

So, what is interesting is when we look at the log files for the region server, and at the time this happens, the region involved is in the middle of a split. Also, the key we failed on is greater than the split key.  After much reading of the code in SplitTransaction and HRegionServer, I came up with a theory.

When a region splits, daughter regions are created and the region is marked as offline/splitting in META (by MetaEditor.offlineParentInMeta).  The daughter regions are brought online and added to META by SplitTransaction.openDaughterRegion and HRegionServer.postOpenDeployTasks.  Later, the META entry for the original region is cleaned up.  The two daughter regions are managed in their own DaughterOpener thread.  This is where I am suspicious: if daughter A’s thread updates META before daughter B’s thread does, then there's a window of time on the client when HConnectionManager.locateRegionInMeta if looking for a key in daughter B will see only daughter A.  The client, I believe, does not check end rows in META, so it will think that daughter A is the region to handle the request.

Now, the question is: are they any circumstances under which sending that request to the wrong region (daughter A instead of daughter B) might yield incorrect results, instead of an exception?  My gut says maybe, but my experiments have not yet managed to find it.  I know that a path that goes through HRegion.checkRow should throw an exception.  But around the point where RegionScanner is making scanners on all the stores I get lost and uncertain about whether there is a case that can slip through the cracks and not call checkRow.  Maybe involving MemStoreScanner?

So, in summary: I think it is definitely undesirable to have daughter A appear in META before daughter B, but I haven't figured out how that might lead to the error we encountered.  I would be a lot more comfortable filing a JIRA if I could find that faulty case, or if someone could explain why I'm barking up the wrong tree.

Can anyone help?

Thanks.
joe


Re: In search of a bug around splitting

Posted by Joe Pallas <jo...@oracle.com>.
I did a fair bit of munging around to finally reproduce this problem relatively consistently and came up with an explanation.  The result is HBASE-4335 and related issues HBASE-4333 and HBASE-4334.

joe


Re: In search of a bug around splitting

Posted by Stack <st...@duboce.net>.
On Fri, Aug 19, 2011 at 6:56 AM, bijieshan <bi...@huawei.com> wrote:
> One query about the rollback:
> If the journal contains the entry of "PONR", it returns directly.
> The regionserver should abort if rollback returns false. Right?
>

Yes.
St.Ack

Re: In search of a bug around splitting

Posted by bijieshan <bi...@huawei.com>.
One query about the rollback:
If the journal contains the entry of "PONR", it returns directly.
The regionserver should abort if rollback returns false. Right?  

Jieshan
-----------------------------------------------------------------

On Fri, Aug 19, 2011 at 12:05 AM, Joseph Pallas
<jo...@oracle.com> wrote:
> The test program has multiple client threads, each of which is performing a stream of operations (it's actually a custom workload running in the YCSB framework).  The program is keeping track of data that was inserted by write operations, and subsequent read operations only retrieve data that was previously written.  The read operation involves first doing a HTableInterface.exists call on a row/cf/qual that is expected to exist.  It is this exists call that we have seen fail.  When the failure occurs, the client reports an exception and stops.  Then we examine the data using the HBase shell, and the item we were looking for is there: the exists call should have succeeded.  Furthermore, the item has a timestamp that shows it really was inserted several minutes previously-it was not inserted right around the time of the failure (which might happen if there were a race condition of some sort in our client).
>

OK.  The exists call is rarely used I'd say which may be why you are
seeing something we don't.



> So, what is interesting is when we look at the log files for the region server, and at the time this happens, the region involved is in the middle of a split. Also, the key we failed on is greater than the split key.  After much reading of the code in SplitTransaction and HRegionServer, I came up with a theory.
>
> When a region splits, daughter regions are created and the region is marked as offline/splitting in META (by MetaEditor.offlineParentInMeta).  The daughter regions are brought online and added to META by SplitTransaction.openDaughterRegion and HRegionServer.postOpenDeployTasks.  Later, the META entry for the original region is cleaned up.  The two daughter regions are managed in their own DaughterOpener thread.  This is where I am suspicious: if daughter A's thread updates META before daughter B's thread does, then there's a window of time on the client when HConnectionManager.locateRegionInMeta if looking for a key in daughter B will see only daughter A.  The client, I believe, does not check end rows in META, so it will think that daughter A is the region to handle the request.
>

ooooh.

> Now, the question is: are they any circumstances under which sending that request to the wrong region (daughter A instead of daughter B) might yield incorrect results, instead of an exception?  My gut says maybe, but my experiments have not yet managed to find it.


Well, we can do a transaction that involved mutliple rows.  Currently
(as I'm sure you know by now), the steps are:

1. close region (NSRE if anyone asks for the region after close)
2. offline region in edit (still NSRE'ing)
3. Open Daughters in parallel and then in parallel update .META.

We should add daughters, daughter B first, then daughter A, and then
offline parent?  If we do it in this sequence, if you are looking for
a row in daughter A, you'll get the parent still and then a NSRE
because its closed.... so you'll go back to .META. and then find
daughter A eventually.  If you are looking for a row in B and A is
online first, you'll think it has it when it doesn't... which would be
bad.

If we offline parent first and then add daughter B first... and we're
looking for row in daughter A, but its not online yet, we'll get
WrongRegionException which would be a blast from the past... something
we used to get in the old days but like polio, managed to eradicate
them.

How does this sound Joe?  We could rig you a SplitTransaction to do
the above.  We could hack one up first and if it did away with your
issue, we'd then spend a bit of time making sure it rolled back
properly on fail (need to make sure rollback works properly).

St.Ack

Re: In search of a bug around splitting

Posted by Joseph Pallas <jo...@oracle.com>.
On Aug 18, 2011, at 8:22 PM, Stack wrote:

> On Fri, Aug 19, 2011 at 12:05 AM, Joseph Pallas
> <jo...@oracle.com> wrote:
>> The test program has multiple client threads, each of which is performing a stream of operations (it's actually a custom workload running in the YCSB framework).  The program is keeping track of data that was inserted by write operations, and subsequent read operations only retrieve data that was previously written.  The read operation involves first doing a HTableInterface.exists call on a row/cf/qual that is expected to exist.  It is this exists call that we have seen fail.  When the failure occurs, the client reports an exception and stops.  Then we examine the data using the HBase shell, and the item we were looking for is there: the exists call should have succeeded.  Furthermore, the item has a timestamp that shows it really was inserted several minutes previously—it was not inserted right around the time of the failure (which might happen if there were a race condition of some sort in our client).
>> 
> 
> OK.  The exists call is rarely used I'd say which may be why you are
> seeing something we don't.

Yeah, I was concerned about this as well.  It looks like the server-side implementation of exists is really just get and see if the result is empty.  But there could be something more subtle there.

> Well, we can do a transaction that involved mutliple rows.  Currently
> (as I'm sure you know by now), the steps are:
> 
> 1. close region (NSRE if anyone asks for the region after close)
> 2. offline region in edit (still NSRE'ing)
> 3. Open Daughters in parallel and then in parallel update .META.
> 
> We should add daughters, daughter B first, then daughter A, and then
> offline parent?  If we do it in this sequence, if you are looking for
> a row in daughter A, you'll get the parent still and then a NSRE
> because its closed.... so you'll go back to .META. and then find
> daughter A eventually.  If you are looking for a row in B and A is
> online first, you'll think it has it when it doesn't... which would be
> bad.
> 
> If we offline parent first and then add daughter B first... and we're
> looking for row in daughter A, but its not online yet, we'll get
> WrongRegionException which would be a blast from the past... something
> we used to get in the old days but like polio, managed to eradicate
> them.

Is that what would happen?  I thought the client would throw RegionOfflineException if .META. says the region is offline (from HConnectionManager.locateRegionInMeta), and if daughter A is not added to .META. until it is online, then wouldn't locateRegionInMeta choose the offlined parent instead of daughter B?

> How does this sound Joe?  We could rig you a SplitTransaction to do
> the above.  We could hack one up first and if it did away with your
> issue, we'd then spend a bit of time making sure it rolled back
> properly on fail (need to make sure rollback works properly).

The awkward part is that this happens rarely enough that I can't say with confidence how long I would need to test it before I could say that the problem is gone.  That's why I was hoping to get a good theory for what happens and to construct a test that forces it.

joe


Re: In search of a bug around splitting

Posted by Stack <st...@duboce.net>.
I looked at how things worked in 0.20 era.  It did parent offlining
first as per 0.90 and then added the daughters A first and then B.
Below is the code.  So, it probably had same issue I'd guess.
St.Ack

    // Mark old region as offline and split in META.
    // NOTE: there is no need for retry logic here. HTable does it for us.
    oldRegionInfo.setOffline(true);
    oldRegionInfo.setSplit(true);
    // Inform the HRegionServer that the parent HRegion is no-longer online.
    this.server.removeFromOnlineRegions(oldRegionInfo);

    Put put = new Put(oldRegionInfo.getRegionName());
    put.add(CATALOG_FAMILY, REGIONINFO_QUALIFIER,
        Writables.getBytes(oldRegionInfo));
    put.add(CATALOG_FAMILY, SPLITA_QUALIFIER,
        Writables.getBytes(newRegions[0].getRegionInfo()));
    put.add(CATALOG_FAMILY, SPLITB_QUALIFIER,
        Writables.getBytes(newRegions[1].getRegionInfo()));
    t.put(put);

    // If we crash here, then the daughters will not be added and we'll have
    // and offlined parent but no daughters to take up the slack.  hbase-2244
    // adds fixup to the metascanners.

    // Add new regions to META
    for (int i = 0; i < newRegions.length; i++) {
      put = new Put(newRegions[i].getRegionName());
      put.add(CATALOG_FAMILY, REGIONINFO_QUALIFIER, Writables.getBytes(
        newRegions[i].getRegionInfo()));
      t.put(put);
    }



On Fri, Aug 19, 2011 at 3:22 AM, Stack <st...@duboce.net> wrote:
> On Fri, Aug 19, 2011 at 12:05 AM, Joseph Pallas
> <jo...@oracle.com> wrote:
>> The test program has multiple client threads, each of which is performing a stream of operations (it's actually a custom workload running in the YCSB framework).  The program is keeping track of data that was inserted by write operations, and subsequent read operations only retrieve data that was previously written.  The read operation involves first doing a HTableInterface.exists call on a row/cf/qual that is expected to exist.  It is this exists call that we have seen fail.  When the failure occurs, the client reports an exception and stops.  Then we examine the data using the HBase shell, and the item we were looking for is there: the exists call should have succeeded.  Furthermore, the item has a timestamp that shows it really was inserted several minutes previously—it was not inserted right around the time of the failure (which might happen if there were a race condition of some sort in our client).
>>
>
> OK.  The exists call is rarely used I'd say which may be why you are
> seeing something we don't.
>
>
>
>> So, what is interesting is when we look at the log files for the region server, and at the time this happens, the region involved is in the middle of a split. Also, the key we failed on is greater than the split key.  After much reading of the code in SplitTransaction and HRegionServer, I came up with a theory.
>>
>> When a region splits, daughter regions are created and the region is marked as offline/splitting in META (by MetaEditor.offlineParentInMeta).  The daughter regions are brought online and added to META by SplitTransaction.openDaughterRegion and HRegionServer.postOpenDeployTasks.  Later, the META entry for the original region is cleaned up.  The two daughter regions are managed in their own DaughterOpener thread.  This is where I am suspicious: if daughter A’s thread updates META before daughter B’s thread does, then there's a window of time on the client when HConnectionManager.locateRegionInMeta if looking for a key in daughter B will see only daughter A.  The client, I believe, does not check end rows in META, so it will think that daughter A is the region to handle the request.
>>
>
> ooooh.
>
>> Now, the question is: are they any circumstances under which sending that request to the wrong region (daughter A instead of daughter B) might yield incorrect results, instead of an exception?  My gut says maybe, but my experiments have not yet managed to find it.
>
>
> Well, we can do a transaction that involved mutliple rows.  Currently
> (as I'm sure you know by now), the steps are:
>
> 1. close region (NSRE if anyone asks for the region after close)
> 2. offline region in edit (still NSRE'ing)
> 3. Open Daughters in parallel and then in parallel update .META.
>
> We should add daughters, daughter B first, then daughter A, and then
> offline parent?  If we do it in this sequence, if you are looking for
> a row in daughter A, you'll get the parent still and then a NSRE
> because its closed.... so you'll go back to .META. and then find
> daughter A eventually.  If you are looking for a row in B and A is
> online first, you'll think it has it when it doesn't... which would be
> bad.
>
> If we offline parent first and then add daughter B first... and we're
> looking for row in daughter A, but its not online yet, we'll get
> WrongRegionException which would be a blast from the past... something
> we used to get in the old days but like polio, managed to eradicate
> them.
>
> How does this sound Joe?  We could rig you a SplitTransaction to do
> the above.  We could hack one up first and if it did away with your
> issue, we'd then spend a bit of time making sure it rolled back
> properly on fail (need to make sure rollback works properly).
>
> St.Ack
>

Re: In search of a bug around splitting

Posted by Stack <st...@duboce.net>.
On Fri, Aug 19, 2011 at 12:05 AM, Joseph Pallas
<jo...@oracle.com> wrote:
> The test program has multiple client threads, each of which is performing a stream of operations (it's actually a custom workload running in the YCSB framework).  The program is keeping track of data that was inserted by write operations, and subsequent read operations only retrieve data that was previously written.  The read operation involves first doing a HTableInterface.exists call on a row/cf/qual that is expected to exist.  It is this exists call that we have seen fail.  When the failure occurs, the client reports an exception and stops.  Then we examine the data using the HBase shell, and the item we were looking for is there: the exists call should have succeeded.  Furthermore, the item has a timestamp that shows it really was inserted several minutes previously—it was not inserted right around the time of the failure (which might happen if there were a race condition of some sort in our client).
>

OK.  The exists call is rarely used I'd say which may be why you are
seeing something we don't.



> So, what is interesting is when we look at the log files for the region server, and at the time this happens, the region involved is in the middle of a split. Also, the key we failed on is greater than the split key.  After much reading of the code in SplitTransaction and HRegionServer, I came up with a theory.
>
> When a region splits, daughter regions are created and the region is marked as offline/splitting in META (by MetaEditor.offlineParentInMeta).  The daughter regions are brought online and added to META by SplitTransaction.openDaughterRegion and HRegionServer.postOpenDeployTasks.  Later, the META entry for the original region is cleaned up.  The two daughter regions are managed in their own DaughterOpener thread.  This is where I am suspicious: if daughter A’s thread updates META before daughter B’s thread does, then there's a window of time on the client when HConnectionManager.locateRegionInMeta if looking for a key in daughter B will see only daughter A.  The client, I believe, does not check end rows in META, so it will think that daughter A is the region to handle the request.
>

ooooh.

> Now, the question is: are they any circumstances under which sending that request to the wrong region (daughter A instead of daughter B) might yield incorrect results, instead of an exception?  My gut says maybe, but my experiments have not yet managed to find it.


Well, we can do a transaction that involved mutliple rows.  Currently
(as I'm sure you know by now), the steps are:

1. close region (NSRE if anyone asks for the region after close)
2. offline region in edit (still NSRE'ing)
3. Open Daughters in parallel and then in parallel update .META.

We should add daughters, daughter B first, then daughter A, and then
offline parent?  If we do it in this sequence, if you are looking for
a row in daughter A, you'll get the parent still and then a NSRE
because its closed.... so you'll go back to .META. and then find
daughter A eventually.  If you are looking for a row in B and A is
online first, you'll think it has it when it doesn't... which would be
bad.

If we offline parent first and then add daughter B first... and we're
looking for row in daughter A, but its not online yet, we'll get
WrongRegionException which would be a blast from the past... something
we used to get in the old days but like polio, managed to eradicate
them.

How does this sound Joe?  We could rig you a SplitTransaction to do
the above.  We could hack one up first and if it did away with your
issue, we'd then spend a bit of time making sure it rolled back
properly on fail (need to make sure rollback works properly).

St.Ack