You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by 周帅锋 <zh...@gmail.com> on 2014/12/04 08:00:06 UTC

split failed caused by FileNotFoundException

In our hbase clusters, split sometimes failed because the file to be
splited does not exist in parent region. In 0.94.2, this will cause
regionserver shutdown because the split transction has reached  PONR state.
In 0.94.20 or 0.98, split will fail and can roll back, because the split
transction only reach  the state offlined_parent.

In 0.94.2, the error is like below:
2014-09-23 22:27:55,710 INFO org.apache.hadoop.hbase.catalog.MetaEditor:
Offlined parent region xxxxx in META
2014-09-23 22:27:55,820 INFO
org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup
of failed split of xxxxx
Caused by: java.io.IOException: java.io.IOException:
java.io.FileNotFoundException: File does not exist: xxxxx
Caused by: java.io.IOException: java.io.FileNotFoundException: File does
not exist: xxxxx
Caused by: java.io.FileNotFoundException: File does not exist: xxxxx
2014-09-23 22:27:55,823 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
xxx,60020,1411383568857: Abort; we got an error after point-of-no-return

The reasion of missing files is a little complex, the whole procedure
include two failure split and one compact:
1) there are too many files in the region and compact is requested, but not
execute because there are many CompactionRequests(compactionRunners) in the
compaction queue. The compactionRequest hodes the object of the Store, and
also hodes a storefile list to compact of the store.

2) the region size is big enough, and split is requested. the region is
offline during spliting,and the store is closed. but the split failed when
spliting files(for some reason, like io busy, etc. causing time out)
2014-09-23 18:26:02,738 INFO
org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup
of failed split of xxxxx; Took too long to split the files and create the
references, aborting split

3) split successfully roll back, and the region is online again. During
roll back procedure, a new Store object is created, but the store in the
compaction queue did not removed, so there are two(or maybe more) store
object in regionserver.

4) the compaction on the store of the region requested before running, and
some storefiles are compact and removed, new bigger storefiles are created.
but the store reinitialized in the rollback split procedure doesn't know
the change of the storefiles.

5) split on region running again and fail again, because the storefiles in
parrent region doesn't exist(removed by compaction). Also, the split
transction doesn't know that there is a new file created by the compaction.
In 0.94.2, this error can't be found until the daughter region open, but
it's too late, the split failed at PONR state, and this will causing
regionserver shutdown. In 0.94.20 and 0.98, when doing splitStoreFiles, it
will looking into the storefile in the parent region and can found the
error before PONR, so split failure can be roll back.
     code in HRegionFileSystem.splitStoreFile:
     ...
     byte[] lastKey = f.createReader().getLastKey();

So, this situation is a fatal error in previous 0.94 version, and also a
common bug in the later 0.94 and higher version. And this is also the
reason why sometimes storefile reader is null(closed by the first failure
split).

RE: split failed caused by FileNotFoundException

Posted by Bijieshan <bi...@huawei.com>.
Nice find, zhoushuaifeng:)

Suggest to raise an issue for 94.

Jieshan.
________________________________________
From: 周帅锋 [zhoushuaifeng@gmail.com]
Sent: Thursday, December 04, 2014 6:01 PM
To: dev
Subject: Re: split failed caused by FileNotFoundException

I rechecked the code in 0.98, this problem is solved by check the store
object in the compactrunner and cance the compact the compact.
HRegion.compact:

      byte[] cf = Bytes.toBytes(store.getColumnFamilyName());
      if (stores.get(cf) != store) {
        LOG.warn("Store " + store.getColumnFamilyName() + " on region " +
this
            + " has been re-instantiated, cancel this compaction request. "
            + " It may be caused by the roll back of split transaction");
        return false;
      }


But, is it better to replease the store object by the new one and continue
the compact on the store, instead of cancel?


2014-12-04 15:00 GMT+08:00 周帅锋 <zh...@gmail.com>:

> In our hbase clusters, split sometimes failed because the file to be
> splited does not exist in parent region. In 0.94.2, this will cause
> regionserver shutdown because the split transction has reached  PONR state.
> In 0.94.20 or 0.98, split will fail and can roll back, because the split
> transction only reach  the state offlined_parent.
>
> In 0.94.2, the error is like below:
> 2014-09-23 22:27:55,710 INFO org.apache.hadoop.hbase.catalog.MetaEditor:
> Offlined parent region xxxxx in META
> 2014-09-23 22:27:55,820 INFO
> org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup
> of failed split of xxxxx
> Caused by: java.io.IOException: java.io.IOException:
> java.io.FileNotFoundException: File does not exist: xxxxx
> Caused by: java.io.IOException: java.io.FileNotFoundException: File does
> not exist: xxxxx
> Caused by: java.io.FileNotFoundException: File does not exist: xxxxx
> 2014-09-23 22:27:55,823 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> xxx,60020,1411383568857: Abort; we got an error after point-of-no-return
>
> The reasion of missing files is a little complex, the whole procedure
> include two failure split and one compact:
> 1) there are too many files in the region and compact is requested, but
> not execute because there are many CompactionRequests(compactionRunners) in
> the compaction queue. The compactionRequest hodes the object of the Store,
> and also hodes a storefile list to compact of the store.
>
> 2) the region size is big enough, and split is requested. the region is
> offline during spliting,and the store is closed. but the split failed when
> spliting files(for some reason, like io busy, etc. causing time out)
> 2014-09-23 18:26:02,738 INFO
> org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup
> of failed split of xxxxx; Took too long to split the files and create the
> references, aborting split
>
> 3) split successfully roll back, and the region is online again. During
> roll back procedure, a new Store object is created, but the store in the
> compaction queue did not removed, so there are two(or maybe more) store
> object in regionserver.
>
> 4) the compaction on the store of the region requested before running, and
> some storefiles are compact and removed, new bigger storefiles are created.
> but the store reinitialized in the rollback split procedure doesn't know
> the change of the storefiles.
>
> 5) split on region running again and fail again, because the storefiles in
> parrent region doesn't exist(removed by compaction). Also, the split
> transction doesn't know that there is a new file created by the compaction.
> In 0.94.2, this error can't be found until the daughter region open, but
> it's too late, the split failed at PONR state, and this will causing
> regionserver shutdown. In 0.94.20 and 0.98, when doing splitStoreFiles, it
> will looking into the storefile in the parent region and can found the
> error before PONR, so split failure can be roll back.
>      code in HRegionFileSystem.splitStoreFile:
>      ...
>      byte[] lastKey = f.createReader().getLastKey();
>
> So, this situation is a fatal error in previous 0.94 version, and also a
> common bug in the later 0.94 and higher version. And this is also the
> reason why sometimes storefile reader is null(closed by the first failure
> split).
>

Re: split failed caused by FileNotFoundException

Posted by 周帅锋 <zh...@gmail.com>.
I rechecked the code in 0.98, this problem is solved by check the store
object in the compactrunner and cance the compact the compact.
HRegion.compact:

      byte[] cf = Bytes.toBytes(store.getColumnFamilyName());
      if (stores.get(cf) != store) {
        LOG.warn("Store " + store.getColumnFamilyName() + " on region " +
this
            + " has been re-instantiated, cancel this compaction request. "
            + " It may be caused by the roll back of split transaction");
        return false;
      }


But, is it better to replease the store object by the new one and continue
the compact on the store, instead of cancel?


2014-12-04 15:00 GMT+08:00 周帅锋 <zh...@gmail.com>:

> In our hbase clusters, split sometimes failed because the file to be
> splited does not exist in parent region. In 0.94.2, this will cause
> regionserver shutdown because the split transction has reached  PONR state.
> In 0.94.20 or 0.98, split will fail and can roll back, because the split
> transction only reach  the state offlined_parent.
>
> In 0.94.2, the error is like below:
> 2014-09-23 22:27:55,710 INFO org.apache.hadoop.hbase.catalog.MetaEditor:
> Offlined parent region xxxxx in META
> 2014-09-23 22:27:55,820 INFO
> org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup
> of failed split of xxxxx
> Caused by: java.io.IOException: java.io.IOException:
> java.io.FileNotFoundException: File does not exist: xxxxx
> Caused by: java.io.IOException: java.io.FileNotFoundException: File does
> not exist: xxxxx
> Caused by: java.io.FileNotFoundException: File does not exist: xxxxx
> 2014-09-23 22:27:55,823 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> xxx,60020,1411383568857: Abort; we got an error after point-of-no-return
>
> The reasion of missing files is a little complex, the whole procedure
> include two failure split and one compact:
> 1) there are too many files in the region and compact is requested, but
> not execute because there are many CompactionRequests(compactionRunners) in
> the compaction queue. The compactionRequest hodes the object of the Store,
> and also hodes a storefile list to compact of the store.
>
> 2) the region size is big enough, and split is requested. the region is
> offline during spliting,and the store is closed. but the split failed when
> spliting files(for some reason, like io busy, etc. causing time out)
> 2014-09-23 18:26:02,738 INFO
> org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup
> of failed split of xxxxx; Took too long to split the files and create the
> references, aborting split
>
> 3) split successfully roll back, and the region is online again. During
> roll back procedure, a new Store object is created, but the store in the
> compaction queue did not removed, so there are two(or maybe more) store
> object in regionserver.
>
> 4) the compaction on the store of the region requested before running, and
> some storefiles are compact and removed, new bigger storefiles are created.
> but the store reinitialized in the rollback split procedure doesn't know
> the change of the storefiles.
>
> 5) split on region running again and fail again, because the storefiles in
> parrent region doesn't exist(removed by compaction). Also, the split
> transction doesn't know that there is a new file created by the compaction.
> In 0.94.2, this error can't be found until the daughter region open, but
> it's too late, the split failed at PONR state, and this will causing
> regionserver shutdown. In 0.94.20 and 0.98, when doing splitStoreFiles, it
> will looking into the storefile in the parent region and can found the
> error before PONR, so split failure can be roll back.
>      code in HRegionFileSystem.splitStoreFile:
>      ...
>      byte[] lastKey = f.createReader().getLastKey();
>
> So, this situation is a fatal error in previous 0.94 version, and also a
> common bug in the later 0.94 and higher version. And this is also the
> reason why sometimes storefile reader is null(closed by the first failure
> split).
>