You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by tsuna <ts...@gmail.com> on 2016/01/28 22:26:27 UTC
Re: All RegionServers stuck on BadVersion from ZK after cluster restart

Resending to user@hadoop.apache.org now that I’m subscribed to that list.

On Thu, Jan 28, 2016 at 10:50 AM, tsuna <ts...@gmail.com> wrote:

> Just to close the loop on this ordeal…
>
> I started by clearing /hbase/splitWAL in ZK and restarting all the RS and
> the HM.  This didn’t change anything.
>
> On Wed, Jan 27, 2016 at 8:42 AM, tsuna <ts...@gmail.com> wrote:
> > 16/01/27 16:33:39 INFO namenode.FSNamesystem: Recovering [Lease.
> > Holder: DFSClient_NONMAPREDUCE_174538359_1, pendingcreates: 2],
> > src=/hbase/WALs/r12s1.sjc.aristanetworks.com
> ,9104,1452811288618-splitting/r12s1.sjc.aristanetworks.com
> %2C9104%2C1452811288618.default.1453728791276
> > 16/01/27 16:33:39 WARN BlockStateChange: BLOCK*
> > BlockInfoUnderConstruction.initLeaseRecovery: No blocks found, lease
> > removed.
>
> I ran dfsadmin fsck -move to make sure that all the files that had lost
> blocks were moved to /lost+found, and this obviously didn’t help HBase,
> because as I stated earlier, only one WAL had lost a block, and 94% of the
> blocks lost affected the HFile of one of the regions.
>
> Yet, somehow, the error above appeared for every single one of the region
> servers, and I ended up having to move more WAL files manually to
> /lost+found:
>
> foo@r12s3:~/hadoop-2.7.1$ ./bin/hdfs dfs -ls /lost+found
> Found 15 items
> drwxr--r--   - foo supergroup          0 2016-01-28 05:56 /lost+found/hbase
> -rw-r--r--   3 foo supergroup         83 2016-01-25 13:33 /lost+found/
> r12s1.sjc.aristanetworks.com%2C9104%2C1452811288618.default.1453728791276
> -rw-r--r--   3 foo supergroup         83 2016-01-25 13:29 /lost+found/
> r12s10.sjc.aristanetworks.com%2C9104%2C1452811286704.default.1453728581434
> -rw-r--r--   3 foo supergroup         83 2016-01-25 13:31 /lost+found/
> r12s11.sjc.aristanetworks.com%2C9104%2C1452811286222.default.1453728710303
> -rw-r--r--   3 foo supergroup         83 2016-01-25 13:30 /lost+found/
> r12s13.sjc.aristanetworks.com%2C9104%2C1452811287287.default.1453728621698
> -rw-r--r--   3 foo supergroup         83 2016-01-25 13:25 /lost+found/
> r12s14.sjc.aristanetworks.com%2C9104%2C1452811286288.default.1453728336644
> -rw-r--r--   3 foo supergroup         83 2016-01-25 13:25 /lost+found/
> r12s15.sjc.aristanetworks.com%2C9104%2C1453158959800.default.1453728342559
> -rw-r--r--   3 foo supergroup         83 2016-01-25 13:26 /lost+found/
> r12s16.sjc.aristanetworks.com%2C9104%2C1452811286456.default.1453728374800
> -rw-r--r--   3 foo supergroup         83 2016-01-25 13:22 /lost+found/
> r12s2.sjc.aristanetworks.com%2C9104%2C1452811286448.default.1453728137282
> -rw-r--r--   3 foo supergroup         83 2016-01-25 13:26 /lost+found/
> r12s3.sjc.aristanetworks.com%2C9104%2C1452811286093.default.1453728393926
> -rw-r--r--   3 foo supergroup         83 2016-01-25 13:35 /lost+found/
> r12s4.sjc.aristanetworks.com%2C9104%2C1452811289547.default.1453728949397
> -rw-r--r--   3 foo supergroup         83 2016-01-25 13:30 /lost+found/
> r12s5.sjc.aristanetworks.com%2C9104%2C1452811125084.default.1453728624262
> -rw-r--r--   3 foo supergroup         83 2016-01-25 13:28 /lost+found/
> r12s6.sjc.aristanetworks.com%2C9104%2C1452811286154.default.1453728483550
> -rw-r--r--   3 foo supergroup         83 2016-01-25 13:28 /lost+found/
> r12s7.sjc.aristanetworks.com%2C9104%2C1452811287528.default.1453728528180
> -rw-r--r--   3 foo supergroup         83 2016-01-25 13:22 /lost+found/
> r12s8.sjc.aristanetworks.com%2C9104%2C1452811287196.default.1453728125912
>
> After doing this and restarting the HMaster, everything came back up
> fine.  I don’t know if doing this caused any additional data loss – this is
> a dev cluster so data loss isn’t a big deal, but if I was to run into this
> issue in production, I would certainly be very nervous about this whole
> situation.
>
> This might turn more into an HDFS question at this point, so I’m Cc’ing
> hdfs-user@ just in case anybody has anything to say there.
>
> We’re going to upgrade to Hadoop 2.7.2 soon, just in case.
>

-- 
Benoit "tsuna" Sigoure