You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2014/07/16 23:55:06 UTC

[jira] [Resolved] (HBASE-1882) Put bad edits into a failed edits log rather than puke and crash

     [ https://issues.apache.org/jira/browse/HBASE-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell resolved HBASE-1882.
-----------------------------------

    Resolution: Cannot Reproduce

Haven't seen stuff like this in ages. Reopen or file new issue if still relevant with modern HBase versions

> Put bad edits into a failed edits log rather than puke and crash
> ----------------------------------------------------------------
>
>                 Key: HBASE-1882
>                 URL: https://issues.apache.org/jira/browse/HBASE-1882
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>
> Bits got flipped (it seems) on a read so a key going into hfile was < the previous.  We throw an exception that causes HRS restart.  Just put bad edits aside rather than die.
> From elsif up on list:
> {code}
> Our HBase system ended up in a looping situation trying to continuously
> re-assign a damaged region across the HBase cluster. We could not properly
> scan or store data in the affected table.
> The triggering event that caused this cascade of errors was an
> java.io.IOException: Added a key not lexically larger than previous
> From the HBase shell "scan '.META.' command we confirmed the name of the
> damaged encoded
> region stored in hdfs. In an attempt to fix this, the data directory for
> the impacted region
> was moved off hdfs and the region was able to be restarted with a blank
> slate.
> Is there a better way to handle this type of failure?
> Is there a way to generate an hlog to re-import the data files we moved
> away?
> HBase Version: 0.20.0, r805538
> Hadoop Version: 0.20.0-plus4681, r767961
> Here are the log entries leading up to the event:
> 2009-09-25 06:57:53,836 INFO org.apache.hadoop.hbase.regionserver.HLog:
> Roll /hbase/.logs/hfs-030035,60020,1253122636703/hlog.dat.1253883473798,
> entries=1479, calcsize=49659443, filesize=49589196. New hlog
> /hbase/.logs/hfs-030035,60020,1253122636703/hlog.dat.1253887073833
> 2009-09-25 07:05:50,089 INFO
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Forced flushing of
> fc_test,\x2Fdata\x2Fmalware\x2F873fc5b61af807827381f120ad7d746984d49426eb3c8b5523b6142285ee0844027f6b45eb8f18aff21b52071a9664e05ceca112a9ff5949c16cda3f27c9c7c2,1253728360248
> because global memstore limit of 396.9m exceeded; currently 397.0m and
> flushing till 24
> 8.1m
> 2009-09-25 07:05:53,294 INFO
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Forced flushing of
> fc_test,\x2Fdata\x2Fdir\x2Fdfd2e9615461d03ba7c22d6d804ec23c3dba1587ffb91736d0e90df0b8aa659d6f55efe49552a83271d9e115bd1d706b865dc42008
> 52493400819628a7be0fc2\x2F2009-09-23_143101\x2Fxml,1253765882755 because
> global memstore limit of 396.9m exceeded; currently 357.1m and flushing
> till
> 248.1m
> 2009-09-25 07:05:55,583 FATAL
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
> required. Forcing serve
> r
> shutdown
> org.apache.hadoop.hbase.DroppedSnapshotException: region:
> fc_test,\x2Fdata\x2Fdir\x2Fdfd2e9615461d03ba7c22d6d804ec23c3dba1587ffb91736d0e90df0b8aa659d6f55efe49552a83271d9e115bd1d706b865dc4200852493400819628a7be0fc2\x2F2009-09-23_143101\
> x2Fxml,1253765882755
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:950)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:843)
>    at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:241)
> at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushSomeRegions(MemStoreFlusher.java:352)
> at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:321)
>    at
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1809)
>    at sun.reflect.GeneratedMethodAccessor82.invoke(Unknown Source)
>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>    at java.lang.reflect.Method.invoke(Unknown Source)
>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>    at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> Caused by: java.io.IOException: Added a key not lexically larger than
> previous
> key=^@?/data/dir/e22677afdb73cc17ed82a974058859e5e74b78ca5a7917cceaa500d2dd3198094ecf91792e8ddafc4768cfb4^D11ff79bb955c50b99c8f80fdff0b4beb413d8ea/2009-09-25_034206^Ejson:^@^@^A#?M?)^D,
> lastkey=^@?/data/dir/e22677afdb73cc17ed82a974058859e5e74b78ca5a7917cceaa500d2dd3198094ecf91792e8ddafc4768cfb4d11ff79bb955c50b99c8f80fdff0b4beb413d8ea^Ejson:^@^@^A#?M?#^D
>    at
> org.apache.hadoop.hbase.io.hfile.HFile$Writer.checkKey(HFile.java:517)
>    at org.apache.hadoop.hbase.io.hfile.HFile$Writer.append(HFile.java:479)
>    at org.apache.hadoop.hbase.io.hfile.HFile$Writer.append(HFile.java:447)
>    at
> org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:525)
>    at
> org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:489)
> 2009-09-25 07:05:55,608 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> request=0.0, regions=15,
>  stores=30, storefiles=43, storefileIndexSize=3, memstoreSize=396,
> usedHeap=540, maxHeap=992, blockCacheSize=73256760, blo
> ckCacheFree=967258312, blockCacheCount=1128, blockCacheHitRatio=90
> 2009-09-25 07:05:55,609 WARN
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush failed
> 2009-09-25 07:05:55,684 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> server on 60020
> Here are the log entries after restarting:
> 2009-09-28 11:36:16,608 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 0 on 60020: starting
> 2009-09-28 11:36:16,620 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> org.apache.hadoop.hbase.NotServingRegionException:
> [B@1a001ff               at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionS
> erver.java:2263)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.
> java:1767)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown
> Source)                  at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>        at java.lang.reflect.Method.invoke(Unknown
> Source)                              at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> {code}
> ....
> and
> {code}
> stack wrote:
> > On Mon, Sep 28, 2009 at 3:27 PM, elsif <el...@gmail.com> wrote:
> >
> >
> >> Our HBase system ended up in a looping situation trying to continuously
> >> re-assign a damaged region across the HBase cluster. We could not properly
> >> scan or store data in the affected table.
> >>
> >> The triggering event that caused this cascade of errors was an
> >> java.io.IOException: Added a key not lexically larger than previous
> >>
> >>
> >
> >
> > Here are the offending keys purportedly:
> >
> > key=^@?/data/dir/e22677afdb73cc17ed82a974058859e5e74b78ca5a7917cceaa500d2dd3198094ecf91792e8ddafc4768cfb4^D11ff79bb955c50b99c8f80fdff0b4beb413d8ea/2009-09-25_034206^Ejson:^@^@^A#?M?)^D,
> > lastkey=^@?/data/dir/e22677afdb73cc17ed82a974058859e5e74b78ca5a7917cceaa500d2dd3198094ecf91792e8ddafc4768cfb4d11ff79bb955c50b99c8f80fdff0b4beb413d8ea^Ejson:^@^@^A#?M?#^D
> >
> >
> >
> > It seems like keys are fine till we get to '^D'.    Can you make these keys
> > or comment on them?  The '^D' is a printable version of whatever the bit of
> > binary was here.  Do you have an idea what it was?  Can you remanufacture
> > this condition?  Something in our comparator is messing up?  Is that
> > possible?
> >
> >
> The keys are all plain text strings with no special characters.  Not
> sure where the '^D' would come from since the same processes is used to
> generate all the keys.
> > This is in .META. table?
> >
> This is from a regular table.
> >
> >
> >
> >
> >
> > >From the HBase shell "scan '.META.' command we confirmed the name of the
> >
> >> damaged encoded
> >> region stored in hdfs. In an attempt to fix this, the data directory for
> >> the impacted region
> >> was moved off hdfs and the region was able to be restarted with a blank
> >> slate.
> >>
> >> Is there a better way to handle this type of failure?
> >>
> >>
> >>
> >
> > There is a script that will repair the broke files rewriting them removing
> > the offending edit.  I'd point you at the script only its up in an Apache
> > JIRA and thats sick at the moment.
> >
> > You could try running:
> >
> > ./bin/hbase org.apache.hadoop.hbase.io.hfile.HFile
> >
> > It has diagnostic and outputting facility.  Pass it the bad files.
> >
> >
> >
> I scanned each of the files with the -k option, no warnings were generated.
> I also extracted all the key values from each file - none of them appear
> to contain the key with the '^D'.
> The 'key' and 'lastkey' listed above were contained in the
> oldlogfile.log.  I opened the oldlogfile.log with a hex editor and
> verified that the key does not contain any binary characters where the
> '^D' is shown in the error log.  The character is actually a lowercase 'd':
> /data/dir/e22677afdb73cc17ed82a974058859e5e74b78ca5a7917cceaa500d2dd3198094ecf91792e8ddafc4768cfb4d11ff79bb955c50b99c8f80fdff0b4beb413d8ea/2009-09-25_034206
> It would seem this was a read error of some kind.
> >
> >
> >> Is there a way to generate an hlog to re-import the data files we moved
> >> away?
> >>
> >>
> >>
> >
> > Above mentioned script is probably the better way to go.
> >
> >
> >
> >
> >> HBase Version: 0.20.0, r805538
> >> Hadoop Version: 0.20.0-plus4681, r767961
> >>
> >>
> >>
> > Are these release 0.20.0?
> >
> The hadoop is release 0.20.0 - the hbase is a pre-release svn checkout.
> > St.Ack
> >
> >
> >
> >
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)