You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Nanheng Wu <na...@gmail.com> on 2011/02/16 23:21:33 UTC

Disabling a table taking very long time

>From time to time I run into issues where disabling a table pretty
much hangs. I am simply calling the disableTable method fo HBaseAdmin.
The table has ~ 500 regions with default region file size. I couldn't
tell anything abnormal from the master's log. When I click on the
region from Master's web UI I get the "RegionOfflineException". I am
using HBase 0.20.6 + Hadoop 0.20.2. My cluster has ~10 nodes with one
node running Master/ZK/NameNode. Can someone help me understand what's
wrong? Thanks!

Re: Disabling a table taking very long time

Posted by Nanheng Wu <na...@gmail.com>.

Sorry J-D, I tried one more thing. I greped for "compaction" on the
.META RS log. I saw a lot of them, each seem to be taking quite long
too (one took 23 secs). Would this be something worth looking into?

11/03/01 16:44:30 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 9sec
11/03/01 16:46:18 INFO regionserver.HRegion: Starting compaction on
region .META.,,1
11/03/01 16:46:18 INFO regionserver.Store: Started compaction of 3
file(s) in info of .META.,,1  into
/hbase/.META./compaction.dir/1028785192, seqid=1298982189854
11/03/01 16:46:27 INFO regionserver.Store: Completed compaction of 3
file(s) in info of .META.,,1; new storefile is
hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/5138927692276532859;
store size is 11.1m
11/03/01 16:46:27 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 9sec
11/03/01 16:47:08 INFO regionserver.HRegion: Starting compaction on
region .META.,,1
11/03/01 16:47:08 INFO regionserver.Store: Started compaction of 3
file(s) in info of .META.,,1  into
/hbase/.META./compaction.dir/1028785192, seqid=1298982189946
11/03/01 16:47:18 INFO regionserver.Store: Completed compaction of 3
file(s) in info of .META.,,1; new storefile is
hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/5873146955360312580;
store size is 11.1m
11/03/01 16:47:18 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 10sec
11/03/01 16:47:18 INFO regionserver.HRegion: Starting compaction on
region .META.,,1
11/03/01 16:47:18 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 0sec
11/03/01 16:47:19 INFO regionserver.HRegion: Starting compaction on
region .META.,,1
11/03/01 16:47:20 INFO regionserver.Store: Started compaction of 3
file(s) in info of .META.,,1  into
/hbase/.META./compaction.dir/1028785192, seqid=1298982190038
11/03/01 16:47:43 INFO regionserver.Store: Completed compaction of 3
file(s) in info of .META.,,1; new storefile is
hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/2025458053952320907;
store size is 11.2m
11/03/01 16:47:43 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 23sec
11/03/01 16:47:43 INFO regionserver.HRegion: Starting compaction on
region .META.,,1
11/03/01 16:47:43 INFO regionserver.Store: Started compaction of 4
file(s) in info of .META.,,1  into
/hbase/.META./compaction.dir/1028785192, seqid=1298982190222
11/03/01 16:47:46 INFO regionserver.Store: Completed compaction of 4
file(s) in info of .META.,,1; new storefile is
hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/7645102490458560555;
store size is 11.2m
11/03/01 16:47:46 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 3sec
11/03/01 16:47:47 INFO regionserver.HRegion: Starting compaction on
region .META.,,1
11/03/01 16:47:47 INFO regionserver.Store: Started compaction of 3
file(s) in info of .META.,,1  into
/hbase/.META./compaction.dir/1028785192, seqid=1298982190267
11/03/01 16:47:57 INFO regionserver.Store: Completed compaction of 3
file(s) in info of .META.,,1; new storefile is
hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/2555741109593674333;
store size is 11.2m
11/03/01 16:47:58 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 10sec
11/03/01 16:47:58 INFO regionserver.HRegion: Starting compaction on
region .META.,,1
11/03/01 16:47:58 INFO regionserver.Store: Started compaction of 3
file(s) in info of .META.,,1  into
/hbase/.META./compaction.dir/1028785192, seqid=1298982190359
11/03/01 16:48:08 INFO regionserver.Store: Completed compaction of 3
file(s) in info of .META.,,1; new storefile is
hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/2032875459950620676;
store size is 11.2m
11/03/01 16:48:09 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 10sec
11/03/01 16:48:09 INFO regionserver.HRegion: Starting compaction on
region .META.,,1
11/03/01 16:48:09 INFO regionserver.Store: Started compaction of 3
file(s) in info of .META.,,1  into
/hbase/.META./compaction.dir/1028785192, seqid=1298982190451
11/03/01 16:48:22 INFO regionserver.Store: Completed compaction of 3
file(s) in info of .META.,,1; new storefile is
hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/1870868686430006280;
store size is 11.3m
11/03/01 16:48:22 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 13sec
11/03/01 16:48:22 INFO regionserver.HRegion: Starting compaction on
region .META.,,1



On Tue, Mar 1, 2011 at 11:01 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
> That is correct. Also, disabling a table requires scanning and
> modifying the .META. region so if it happens to be slow then that
> operation will be slow too.
>
> J-D
>
> On Tue, Mar 1, 2011 at 10:55 AM, Nanheng Wu <na...@gmail.com> wrote:
>> I see, so the Get only gets affected on regions not in client's cache.
>> I think that explains why that error seems hard to reproduce. This is
>> how I understood a Get works: if the client doesn't have the region in
>> its cache it will read .META. table; and if it does it makes a RPC to
>> the RS directly; also each RS has some data about the regions it
>> actually hosts, so if the region is not there it means the client's
>> cache is stale and the client has to read the .META. again. Is this
>> correct?
>>
>> On Tue, Mar 1, 2011 at 10:48 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>> Depends if you have to query .META. first to get the location of the
>>> region that hosts the row.
>>>
>>> J-D
>>>
>>> On Tue, Mar 1, 2011 at 10:45 AM, Nanheng Wu <na...@gmail.com> wrote:
>>>> Man I appreciate so much all the help you provided so far. I guess
>>>> I'll keep digging. Would this meta scan cause Get or Scan on user
>>>> tables to be very very slow too? That was actually was the problem I
>>>> first noticed: from time to time a client Get or Scan would be
>>>> extremely slow on my tables.
>>>>
>>>> On Tue, Mar 1, 2011 at 10:38 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>> The store file looks fairly new, maybe it was just major compacted?
>>>>> You could easily figure that out from the region server's log.
>>>>>
>>>>> Your problem is definitely a .META. region slowness. If scanning 8k
>>>>> rows takes 30 minutes, scanning and updating 260 rows at the same time
>>>>> probably isn't much faster.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Tue, Mar 1, 2011 at 10:34 AM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>> J-D,
>>>>>>
>>>>>> It looks like I don't have that many files:
>>>>>>
>>>>>> drwxr-xr-x   - root supergroup          0 2011-02-28 23:21
>>>>>> /hbase/.META./1028785192
>>>>>> -rw-r--r--   3 root supergroup        932 2011-02-03 19:23
>>>>>> /hbase/.META./1028785192/.regioninfo
>>>>>> drwxr-xr-x   - root supergroup          0 2011-03-01 17:13
>>>>>> /hbase/.META./1028785192/info
>>>>>> -rw-r--r--   3 root supergroup   11831330 2011-03-01 17:13
>>>>>> /hbase/.META./1028785192/info/6687606720393313750
>>>>>> drwxr-xr-x   - root supergroup          0 2011-03-01 17:13
>>>>>> /hbase/.META./compaction.dir
>>>>>>
>>>>>> So you think compaction on .META. table on the RS could be the
>>>>>> problem? while the log metaScanner is running I saw virtually no logs
>>>>>> on that server though, I did see compactions once the meta scan ended
>>>>>> and region close started to happen. I wonder what the cluster is doing
>>>>>> while the meta scan is going.
>>>>>>
>>>>>> I would definitely upgrade in the near future, it's just that I am
>>>>>> afraid upgrading won't guarantee a fix if I don't understand the
>>>>>> nature of the problem more clearly. We also have another test cluster,
>>>>>> running the same versions of everything on cheaper hardware and it
>>>>>> doesn't have this problem.
>>>>>>
>>>>>> On Tue, Mar 1, 2011 at 9:52 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>>> That long .META. scan might be the culprit, good catch!
>>>>>>>
>>>>>>> If you run this from the hadoop folder:
>>>>>>>
>>>>>>> bin/hadoop dfs -lsr /hbase/.META.
>>>>>>>
>>>>>>> Do you see a ton of files? Let's define "ton" as more than 16 files.
>>>>>>> If so, and I see you have a very high number of regions, then you
>>>>>>> could be hitting this problem:
>>>>>>> https://issues.apache.org/jira/browse/HBASE-3499
>>>>>>>
>>>>>>> In short, every few .META. change will result in a flush and it could
>>>>>>> be that the region server hosting it isn't able to keep up with the
>>>>>>> compactions. It's done so that since append isn't supported in your
>>>>>>> hadoop that you won't lose too much data if the node dies, which is
>>>>>>> really bad when it happens to .META.
>>>>>>>
>>>>>>> In your case it might not be so bad to set your .META.'s MEMSTORE_SIZE
>>>>>>> back to the default value if you're not writing hence not splitting.
>>>>>>> Running the script attached in that jira would do it (don't forget to
>>>>>>> restart HBase after running it).
>>>>>>>
>>>>>>> Lastly, upgrading to HBase 0.90.1 and a hadoop that supports append
>>>>>>> should be a priority.
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Tue, Mar 1, 2011 at 9:30 AM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>> Hi J-D:
>>>>>>>>
>>>>>>>>  I did the scan like you suggested but no splits came up. This kind
>>>>>>>> of makes sense to me, since we write Hfiles using a MR job and then
>>>>>>>> pretty much use HBase for read-only so maybe that's why the regions
>>>>>>>> never got split? I think my slow disable problem is related to this
>>>>>>>> RegionManager.metaScanner. I greped for that in the Masters log and I
>>>>>>>> saw:
>>>>>>>> 11/03/01 15:30:37 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>>>>>> .META.,,1, startKey: <>}
>>>>>>>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
>>>>>>>> regionname: .META.,,1, startKey: <>} complete
>>>>>>>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>>>>>> .META.,,1, startKey: <>}
>>>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>>>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
>>>>>>>> regionname: .META.,,1, startKey: <>} complete
>>>>>>>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>>>>>> .META.,,1, startKey: <>}
>>>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>>>> 11/03/01 16:46:07 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>>> scan of 8298 row(s) of meta region {server: 10.146.14.38:60020,
>>>>>>>> regionname: .META.,,1, startKey: <>} complete
>>>>>>>>
>>>>>>>> It looks like meta scanner is always running, each time taking 20-30
>>>>>>>> minutes. Here's what I did:
>>>>>>>>
>>>>>>>> 1) disable table -> at 16:28:31, according the log above there's a
>>>>>>>> meta scan in progress.
>>>>>>>>
>>>>>>>> 2 ) the command hangs, no output in .META RS's log
>>>>>>>> at 11/03/01 16:46:57 the client got 11/03/01 16:46:58 DEBUG
>>>>>>>> zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server got
>>>>>>>> 10.146.14.38:60020
>>>>>>>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Found
>>>>>>>> ROOT at 10.146.14.38:60020
>>>>>>>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Cached
>>>>>>>> location for .META.,,1 is 10.146.14.38:60020
>>>>>>>> This corresponds exact to when the meta scan completed. .META RS
>>>>>>>> started to spew some logs about closing regions.
>>>>>>>>
>>>>>>>> 3) Another meta scan started and the client hangs again until it
>>>>>>>> finishes, and disable command returns.The whole thing took 2688.5930
>>>>>>>> seconds
>>>>>>>>
>>>>>>>> What kind of problem does it look like to you J-D? What does this
>>>>>>>> background task metaScanner do?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 28, 2011 at 5:17 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>>>>> I didn't get most of the information I was looking for... but I do
>>>>>>>>> have something to work with. So I asked whether the metaScanner was
>>>>>>>>> making progress or not, and from that log I think I see it does:
>>>>>>>>>
>>>>>>>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>>>>>>>>> Rowscanned=261, rowsOffline=3
>>>>>>>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>>>>>>>>>> regions to be disabled from myTable
>>>>>>>>>
>>>>>>>>> Only 3 rows are offline, whereas we expect all the region to be
>>>>>>>>> offline. I expect you see those lines a lot? If so, one thing I'd like
>>>>>>>>> to see is the result of this command:
>>>>>>>>>
>>>>>>>>> scan '.META.', {STARTROW => "myTable,,", LIMIT => 261}
>>>>>>>>>
>>>>>>>>> It's going to be big. Then grep in the result for the string SPLIT,
>>>>>>>>> and please post back here the lines that match.
>>>>>>>>>
>>>>>>>>> J-D
>>>>>>>>>
>>>>>>>>> On Mon, Feb 28, 2011 at 5:04 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>>> I issued two more disable commands and looked at the .META RS. The
>>>>>>>>>> last log entry looks like:
>>>>>>>>>>
>>>>>>>>>> 11/03/01 00:29:14 INFO regionserver.HLog: Roll
>>>>>>>>>> /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog
>>>>>>>>>> .dat.1298935752660, entries=13527, calcsize=3839046, filesize=2560143.
>>>>>>>>>> New hlog /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog.dat.1298939354072
>>>>>>>>>>
>>>>>>>>>> I issued the command at: 00:34:59, the client debug message has
>>>>>>>>>> nothing but "11/03/01 00:34:59 DEBUG zookeeper.ClientCnxn: Got ping
>>>>>>>>>> response for sessionid:0x12e6e938e9c000b after 90ms" until 00:47:19.
>>>>>>>>>> Then all of a sudden it outputs the following:
>>>>>>>>>>
>>>>>>>>>> 0:47:28 DEBUG client.HConnectionManager$TableServers: Cache hit for
>>>>>>>>>> row <> in tableName .META.: location server 10.146.14.38:60020,
>>>>>>>>>> location region name .META.,,1
>>>>>>>>>> 11/03/01 00:47:28 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode
>>>>>>>>>> /hbase/root-region-server got 10.146.14.38:60020
>>>>>>>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Found
>>>>>>>>>> ROOT at 10.146.14.38:60020
>>>>>>>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Cached
>>>>>>>>>> location for .META.,,1 is 10.146.14.38:60020
>>>>>>>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>>>>>>>>> Rowscanned=261, rowsOffline=3
>>>>>>>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>>>>>>>>>> regions to be disabled from myTable
>>>>>>>>>> 11/03/01 00:47:31 DEBUG client.HBaseAdmin: Wake. Waiting for all
>>>>>>>>>> regions to be disabled from myTable
>>>>>>>>>>
>>>>>>>>>> Here's the full thread dump of the .META RS from 00:34:59 to  00:47:31
>>>>>>>>>> http://pastebin.com/Ywwx8afQ
>>>>>>>>>>
>>>>>>>>>> I hope I compiled some useful information here. I am pretty lost at
>>>>>>>>>> this point, I would really appreciate your help!
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 28, 2011 at 2:58 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>>>>>>> Ok so no debug from the disable command? BTW you should output your GC
>>>>>>>>>>> somewhere else like the .out file or a completely separate file.
>>>>>>>>>>>
>>>>>>>>>>> J-D
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 28, 2011 at 2:54 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>>>>> I see, so I should jstack the .META region. I'll do that.
>>>>>>>>>>>>
>>>>>>>>>>>> The master log pretty much looks like this: should I grep for
>>>>>>>>>>>> something specific?
>>>>>>>>>>>>
>>>>>>>>>>>> 11/02/28 22:52:56 INFO master.BaseScanner: RegionManager.rootScanner
>>>>>>>>>>>> scan of 1 row(s) of meta region complete
>>>>>>>>>>>> 2011-02-28T22:52:57.138+0000: 263241.014: [GC 263241.014: [ParNew:
>>>>>>>>>>>> 19136K->2112K(19136K), 0.0033350 secs] 378606K->36159
>>>>>>>>>>>> 6K(641944K) icms_dc=0 , 0.0034330 secs] [Times: user=0.02 sys=0.00,
>>>>>>>>>>>> real=0.00 secs]
>>>>>>>>>>>> 2011-02-28T22:53:04.956+0000: 263248.832: [GC 263248.832: [ParNew:
>>>>>>>>>>>> 19068K->2112K(19136K), 0.0029690 secs] 378552K->36163
>>>>>>>>>>>> 0K(641944K) icms_dc=0 , 0.0030460 secs] [Times: user=0.02 sys=0.00,
>>>>>>>>>>>> real=0.00 secs]
>>>>>>>>>>>> 2011-02-28T22:53:12.664+0000: 263256.540: [GC 263256.541: [ParNew:
>>>>>>>>>>>> 19136K->2112K(19136K), 0.0037690 secs] 378654K->36164
>>>>>>>>>>>> 7K(641944K) icms_dc=0 , 0.0038660 secs] [Times: user=0.01 sys=0.00,
>>>>>>>>>>>> real=0.01 secs]
>>>>>>>>>>>> 2011-02-28T22:53:20.608+0000: 263264.484: [GC 263264.484: [ParNew:
>>>>>>>>>>>> 19136K->1810K(19136K), 0.0032600 secs] 378671K->36134
>>>>>>>>>>>> 6K(641944K) icms_dc=0 , 0.0033470 secs] [Times: user=0.02 sys=0.00,
>>>>>>>>>>>> real=0.01 secs]
>>>>>>>>>>>> 2011-02-28T22:53:28.278+0000: 263272.154: [GC 263272.154: [ParNew:
>>>>>>>>>>>> 18834K->2112K(19136K), 0.0029870 secs] 378370K->36166
>>>>>>>>>>>> 1K(641944K) icms_dc=0 , 0.0030670 secs] [Times: user=0.01 sys=0.00,
>>>>>>>>>>>> real=0.01 secs]
>>>>>>>>>>>> 2011-02-28T22:53:35.919+0000: 263279.795: [GC 263279.795: [ParNew:
>>>>>>>>>>>> 19136K->1747K(19136K), 0.0037090 secs] 378685K->36129
>>>>>>>>>>>> 8K(641944K) icms_dc=0 , 0.0037920 secs] [Times: user=0.02 sys=0.00,
>>>>>>>>>>>> real=0.01 secs]
>>>>>>>>>>>> 11/02/28 22:53:36 INFO master.ServerManager: 9 region servers, 0 dead,
>>>>>>>>>>>> average load 760.2222222222222
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 28, 2011 at 2:49 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>>>>>>>>> Are you able to tell if that call in metaScanner is hanging or it's
>>>>>>>>>>>>> making multiple calls to the .META. region?
>>>>>>>>>>>>>
>>>>>>>>>>>>> If former, then jstack the region server that hosts .META. and see
>>>>>>>>>>>>> where it's blocked.
>>>>>>>>>>>>>
>>>>>>>>>>>>> if latter, then it means your .META. region is slow? Again, what's
>>>>>>>>>>>>> going on on the RS that hosts .META.?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Finally, what's the master's log like during that time?
>>>>>>>>>>>>>
>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Feb 28, 2011 at 2:41 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Thanks so much for your help so far! I sent disable commands on 4
>>>>>>>>>>>>>> rather small tables and they got stuck for a long time again, so I
>>>>>>>>>>>>>> took jstack of the master. From what I can tell, all disableTable
>>>>>>>>>>>>>> calls are blocked by a meta scanner thread (sample log below). At the
>>>>>>>>>>>>>> moment there were no other requests to the server at all. How should I
>>>>>>>>>>>>>> investigate this further? If it helps, here's some background: I have
>>>>>>>>>>>>>> several datasets, each of them is in a separate table; Our data
>>>>>>>>>>>>>> pipeline produces a new version of each dataset everyday and only the
>>>>>>>>>>>>>> lastest version should be used. This is how the data is loaded: for
>>>>>>>>>>>>>> each dataset 1) run a MR job and outputs HFiles 2) call loadTable.rb
>>>>>>>>>>>>>> to create a new table 3) disable and drop the previous version. As a
>>>>>>>>>>>>>> result some calls to load table and drop table would overlap. Please
>>>>>>>>>>>>>> let me know if something stands out to you as a potential culprit.
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> BTW, I am running Hadoop 0.20.2 with HBase 0.20.6
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thread 47 (IPC Server handler 13 on 60000):
>>>>>>>>>>>>>>  State: BLOCKED
>>>>>>>>>>>>>>  Blocked count: 3801
>>>>>>>>>>>>>>  Waited count: 72719
>>>>>>>>>>>>>>  Blocked on java.lang.Object@75ac522c
>>>>>>>>>>>>>>  Blocked by 27 (RegionManager.metaScanner)
>>>>>>>>>>>>>>  Stack:
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842)
>>>>>>>>>>>>>>    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>>>>>>>>>>>>    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>>>>>>>>    java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>>>>>>>>>>  State: WAITING
>>>>>>>>>>>>>>  Blocked count: 1526058
>>>>>>>>>>>>>>  Waited count: 1488998
>>>>>>>>>>>>>>  Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0
>>>>>>>>>>>>>>  Stack:
>>>>>>>>>>>>>>    java.lang.Object.wait(Native Method)
>>>>>>>>>>>>>>    java.lang.Object.wait(Object.java:485)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>>>>>>>>>>>>>    $Proxy1.get(Unknown Source)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Feb 25, 2011 at 10:23 AM, Jean-Daniel Cryans
>>>>>>>>>>>>>> <jd...@apache.org> wrote:
>>>>>>>>>>>>>>> An hour to disable? That doesn't sound right at all :)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would approach this problem like I generally do with HBase issue,
>>>>>>>>>>>>>>> first check the master log for any weirdness regarding my problem (in
>>>>>>>>>>>>>>> this case, grep for the table name).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Then I would look the region server log(s) of the nodes that were
>>>>>>>>>>>>>>> hosting regions from that table. You should see the steps taken to
>>>>>>>>>>>>>>> disable the regions (starting to close, flush, region completely
>>>>>>>>>>>>>>> closed).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If you are able to do it while it's taking a very long time to
>>>>>>>>>>>>>>> disable, try to jstack the process the seems to be hanging.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Finally, like I said in another thread, there's a bug in 0.20.6 that
>>>>>>>>>>>>>>> almost prevent disabling a table (or re-enabling) if any region
>>>>>>>>>>>>>>> recently split and the parent wasn't cleaned yet from .META. and that
>>>>>>>>>>>>>>> is fixed in 0.90.1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>>>>>>>>> I think you are right, maybe in the long run I need to re-architect my
>>>>>>>>>>>>>>>> system so that it doesn't need to create new and delete old tables all
>>>>>>>>>>>>>>>> the time. In the short term I am having a really hard time with the
>>>>>>>>>>>>>>>> disabling function, I ran a disable command on a very small table
>>>>>>>>>>>>>>>> (probably dozen of MBs in size) and are no client using the cluster at
>>>>>>>>>>>>>>>> all, and that took about 1 hour to complete! The weird thing is on the
>>>>>>>>>>>>>>>> web UI only the region server carrying the META table has non-zero
>>>>>>>>>>>>>>>> requests, all other RS have 0 requests the entire time. I would think
>>>>>>>>>>>>>>>> they should get some request to flush the memstore at least. I *am*
>>>>>>>>>>>>>>>> using the same RS nodes for some map reduce job at the time and top
>>>>>>>>>>>>>>>> shows the memory usage is almost full on the META region. Would you
>>>>>>>>>>>>>>>> have some idea of what I should investigate?
>>>>>>>>>>>>>>>> Thanks so much.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Disabling a table taking very long time

Posted by Jean-Daniel Cryans <jd...@apache.org>.

That is correct. Also, disabling a table requires scanning and
modifying the .META. region so if it happens to be slow then that
operation will be slow too.

J-D

On Tue, Mar 1, 2011 at 10:55 AM, Nanheng Wu <na...@gmail.com> wrote:
> I see, so the Get only gets affected on regions not in client's cache.
> I think that explains why that error seems hard to reproduce. This is
> how I understood a Get works: if the client doesn't have the region in
> its cache it will read .META. table; and if it does it makes a RPC to
> the RS directly; also each RS has some data about the regions it
> actually hosts, so if the region is not there it means the client's
> cache is stale and the client has to read the .META. again. Is this
> correct?
>
> On Tue, Mar 1, 2011 at 10:48 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> Depends if you have to query .META. first to get the location of the
>> region that hosts the row.
>>
>> J-D
>>
>> On Tue, Mar 1, 2011 at 10:45 AM, Nanheng Wu <na...@gmail.com> wrote:
>>> Man I appreciate so much all the help you provided so far. I guess
>>> I'll keep digging. Would this meta scan cause Get or Scan on user
>>> tables to be very very slow too? That was actually was the problem I
>>> first noticed: from time to time a client Get or Scan would be
>>> extremely slow on my tables.
>>>
>>> On Tue, Mar 1, 2011 at 10:38 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>> The store file looks fairly new, maybe it was just major compacted?
>>>> You could easily figure that out from the region server's log.
>>>>
>>>> Your problem is definitely a .META. region slowness. If scanning 8k
>>>> rows takes 30 minutes, scanning and updating 260 rows at the same time
>>>> probably isn't much faster.
>>>>
>>>> J-D
>>>>
>>>> On Tue, Mar 1, 2011 at 10:34 AM, Nanheng Wu <na...@gmail.com> wrote:
>>>>> J-D,
>>>>>
>>>>> It looks like I don't have that many files:
>>>>>
>>>>> drwxr-xr-x   - root supergroup          0 2011-02-28 23:21
>>>>> /hbase/.META./1028785192
>>>>> -rw-r--r--   3 root supergroup        932 2011-02-03 19:23
>>>>> /hbase/.META./1028785192/.regioninfo
>>>>> drwxr-xr-x   - root supergroup          0 2011-03-01 17:13
>>>>> /hbase/.META./1028785192/info
>>>>> -rw-r--r--   3 root supergroup   11831330 2011-03-01 17:13
>>>>> /hbase/.META./1028785192/info/6687606720393313750
>>>>> drwxr-xr-x   - root supergroup          0 2011-03-01 17:13
>>>>> /hbase/.META./compaction.dir
>>>>>
>>>>> So you think compaction on .META. table on the RS could be the
>>>>> problem? while the log metaScanner is running I saw virtually no logs
>>>>> on that server though, I did see compactions once the meta scan ended
>>>>> and region close started to happen. I wonder what the cluster is doing
>>>>> while the meta scan is going.
>>>>>
>>>>> I would definitely upgrade in the near future, it's just that I am
>>>>> afraid upgrading won't guarantee a fix if I don't understand the
>>>>> nature of the problem more clearly. We also have another test cluster,
>>>>> running the same versions of everything on cheaper hardware and it
>>>>> doesn't have this problem.
>>>>>
>>>>> On Tue, Mar 1, 2011 at 9:52 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>> That long .META. scan might be the culprit, good catch!
>>>>>>
>>>>>> If you run this from the hadoop folder:
>>>>>>
>>>>>> bin/hadoop dfs -lsr /hbase/.META.
>>>>>>
>>>>>> Do you see a ton of files? Let's define "ton" as more than 16 files.
>>>>>> If so, and I see you have a very high number of regions, then you
>>>>>> could be hitting this problem:
>>>>>> https://issues.apache.org/jira/browse/HBASE-3499
>>>>>>
>>>>>> In short, every few .META. change will result in a flush and it could
>>>>>> be that the region server hosting it isn't able to keep up with the
>>>>>> compactions. It's done so that since append isn't supported in your
>>>>>> hadoop that you won't lose too much data if the node dies, which is
>>>>>> really bad when it happens to .META.
>>>>>>
>>>>>> In your case it might not be so bad to set your .META.'s MEMSTORE_SIZE
>>>>>> back to the default value if you're not writing hence not splitting.
>>>>>> Running the script attached in that jira would do it (don't forget to
>>>>>> restart HBase after running it).
>>>>>>
>>>>>> Lastly, upgrading to HBase 0.90.1 and a hadoop that supports append
>>>>>> should be a priority.
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Tue, Mar 1, 2011 at 9:30 AM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>> Hi J-D:
>>>>>>>
>>>>>>>  I did the scan like you suggested but no splits came up. This kind
>>>>>>> of makes sense to me, since we write Hfiles using a MR job and then
>>>>>>> pretty much use HBase for read-only so maybe that's why the regions
>>>>>>> never got split? I think my slow disable problem is related to this
>>>>>>> RegionManager.metaScanner. I greped for that in the Masters log and I
>>>>>>> saw:
>>>>>>> 11/03/01 15:30:37 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>>>>> .META.,,1, startKey: <>}
>>>>>>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
>>>>>>> regionname: .META.,,1, startKey: <>} complete
>>>>>>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>>>>> .META.,,1, startKey: <>}
>>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
>>>>>>> regionname: .META.,,1, startKey: <>} complete
>>>>>>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>>>>> .META.,,1, startKey: <>}
>>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>>> 11/03/01 16:46:07 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>> scan of 8298 row(s) of meta region {server: 10.146.14.38:60020,
>>>>>>> regionname: .META.,,1, startKey: <>} complete
>>>>>>>
>>>>>>> It looks like meta scanner is always running, each time taking 20-30
>>>>>>> minutes. Here's what I did:
>>>>>>>
>>>>>>> 1) disable table -> at 16:28:31, according the log above there's a
>>>>>>> meta scan in progress.
>>>>>>>
>>>>>>> 2 ) the command hangs, no output in .META RS's log
>>>>>>> at 11/03/01 16:46:57 the client got 11/03/01 16:46:58 DEBUG
>>>>>>> zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server got
>>>>>>> 10.146.14.38:60020
>>>>>>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Found
>>>>>>> ROOT at 10.146.14.38:60020
>>>>>>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Cached
>>>>>>> location for .META.,,1 is 10.146.14.38:60020
>>>>>>> This corresponds exact to when the meta scan completed. .META RS
>>>>>>> started to spew some logs about closing regions.
>>>>>>>
>>>>>>> 3) Another meta scan started and the client hangs again until it
>>>>>>> finishes, and disable command returns.The whole thing took 2688.5930
>>>>>>> seconds
>>>>>>>
>>>>>>> What kind of problem does it look like to you J-D? What does this
>>>>>>> background task metaScanner do?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 28, 2011 at 5:17 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>>>> I didn't get most of the information I was looking for... but I do
>>>>>>>> have something to work with. So I asked whether the metaScanner was
>>>>>>>> making progress or not, and from that log I think I see it does:
>>>>>>>>
>>>>>>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>>>>>>>> Rowscanned=261, rowsOffline=3
>>>>>>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>>>>>>>>> regions to be disabled from myTable
>>>>>>>>
>>>>>>>> Only 3 rows are offline, whereas we expect all the region to be
>>>>>>>> offline. I expect you see those lines a lot? If so, one thing I'd like
>>>>>>>> to see is the result of this command:
>>>>>>>>
>>>>>>>> scan '.META.', {STARTROW => "myTable,,", LIMIT => 261}
>>>>>>>>
>>>>>>>> It's going to be big. Then grep in the result for the string SPLIT,
>>>>>>>> and please post back here the lines that match.
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Mon, Feb 28, 2011 at 5:04 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>> I issued two more disable commands and looked at the .META RS. The
>>>>>>>>> last log entry looks like:
>>>>>>>>>
>>>>>>>>> 11/03/01 00:29:14 INFO regionserver.HLog: Roll
>>>>>>>>> /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog
>>>>>>>>> .dat.1298935752660, entries=13527, calcsize=3839046, filesize=2560143.
>>>>>>>>> New hlog /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog.dat.1298939354072
>>>>>>>>>
>>>>>>>>> I issued the command at: 00:34:59, the client debug message has
>>>>>>>>> nothing but "11/03/01 00:34:59 DEBUG zookeeper.ClientCnxn: Got ping
>>>>>>>>> response for sessionid:0x12e6e938e9c000b after 90ms" until 00:47:19.
>>>>>>>>> Then all of a sudden it outputs the following:
>>>>>>>>>
>>>>>>>>> 0:47:28 DEBUG client.HConnectionManager$TableServers: Cache hit for
>>>>>>>>> row <> in tableName .META.: location server 10.146.14.38:60020,
>>>>>>>>> location region name .META.,,1
>>>>>>>>> 11/03/01 00:47:28 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode
>>>>>>>>> /hbase/root-region-server got 10.146.14.38:60020
>>>>>>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Found
>>>>>>>>> ROOT at 10.146.14.38:60020
>>>>>>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Cached
>>>>>>>>> location for .META.,,1 is 10.146.14.38:60020
>>>>>>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>>>>>>>> Rowscanned=261, rowsOffline=3
>>>>>>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>>>>>>>>> regions to be disabled from myTable
>>>>>>>>> 11/03/01 00:47:31 DEBUG client.HBaseAdmin: Wake. Waiting for all
>>>>>>>>> regions to be disabled from myTable
>>>>>>>>>
>>>>>>>>> Here's the full thread dump of the .META RS from 00:34:59 to  00:47:31
>>>>>>>>> http://pastebin.com/Ywwx8afQ
>>>>>>>>>
>>>>>>>>> I hope I compiled some useful information here. I am pretty lost at
>>>>>>>>> this point, I would really appreciate your help!
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 28, 2011 at 2:58 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>>>>>> Ok so no debug from the disable command? BTW you should output your GC
>>>>>>>>>> somewhere else like the .out file or a completely separate file.
>>>>>>>>>>
>>>>>>>>>> J-D
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 28, 2011 at 2:54 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>>>> I see, so I should jstack the .META region. I'll do that.
>>>>>>>>>>>
>>>>>>>>>>> The master log pretty much looks like this: should I grep for
>>>>>>>>>>> something specific?
>>>>>>>>>>>
>>>>>>>>>>> 11/02/28 22:52:56 INFO master.BaseScanner: RegionManager.rootScanner
>>>>>>>>>>> scan of 1 row(s) of meta region complete
>>>>>>>>>>> 2011-02-28T22:52:57.138+0000: 263241.014: [GC 263241.014: [ParNew:
>>>>>>>>>>> 19136K->2112K(19136K), 0.0033350 secs] 378606K->36159
>>>>>>>>>>> 6K(641944K) icms_dc=0 , 0.0034330 secs] [Times: user=0.02 sys=0.00,
>>>>>>>>>>> real=0.00 secs]
>>>>>>>>>>> 2011-02-28T22:53:04.956+0000: 263248.832: [GC 263248.832: [ParNew:
>>>>>>>>>>> 19068K->2112K(19136K), 0.0029690 secs] 378552K->36163
>>>>>>>>>>> 0K(641944K) icms_dc=0 , 0.0030460 secs] [Times: user=0.02 sys=0.00,
>>>>>>>>>>> real=0.00 secs]
>>>>>>>>>>> 2011-02-28T22:53:12.664+0000: 263256.540: [GC 263256.541: [ParNew:
>>>>>>>>>>> 19136K->2112K(19136K), 0.0037690 secs] 378654K->36164
>>>>>>>>>>> 7K(641944K) icms_dc=0 , 0.0038660 secs] [Times: user=0.01 sys=0.00,
>>>>>>>>>>> real=0.01 secs]
>>>>>>>>>>> 2011-02-28T22:53:20.608+0000: 263264.484: [GC 263264.484: [ParNew:
>>>>>>>>>>> 19136K->1810K(19136K), 0.0032600 secs] 378671K->36134
>>>>>>>>>>> 6K(641944K) icms_dc=0 , 0.0033470 secs] [Times: user=0.02 sys=0.00,
>>>>>>>>>>> real=0.01 secs]
>>>>>>>>>>> 2011-02-28T22:53:28.278+0000: 263272.154: [GC 263272.154: [ParNew:
>>>>>>>>>>> 18834K->2112K(19136K), 0.0029870 secs] 378370K->36166
>>>>>>>>>>> 1K(641944K) icms_dc=0 , 0.0030670 secs] [Times: user=0.01 sys=0.00,
>>>>>>>>>>> real=0.01 secs]
>>>>>>>>>>> 2011-02-28T22:53:35.919+0000: 263279.795: [GC 263279.795: [ParNew:
>>>>>>>>>>> 19136K->1747K(19136K), 0.0037090 secs] 378685K->36129
>>>>>>>>>>> 8K(641944K) icms_dc=0 , 0.0037920 secs] [Times: user=0.02 sys=0.00,
>>>>>>>>>>> real=0.01 secs]
>>>>>>>>>>> 11/02/28 22:53:36 INFO master.ServerManager: 9 region servers, 0 dead,
>>>>>>>>>>> average load 760.2222222222222
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 28, 2011 at 2:49 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>>>>>>>> Are you able to tell if that call in metaScanner is hanging or it's
>>>>>>>>>>>> making multiple calls to the .META. region?
>>>>>>>>>>>>
>>>>>>>>>>>> If former, then jstack the region server that hosts .META. and see
>>>>>>>>>>>> where it's blocked.
>>>>>>>>>>>>
>>>>>>>>>>>> if latter, then it means your .META. region is slow? Again, what's
>>>>>>>>>>>> going on on the RS that hosts .META.?
>>>>>>>>>>>>
>>>>>>>>>>>> Finally, what's the master's log like during that time?
>>>>>>>>>>>>
>>>>>>>>>>>> J-D
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 28, 2011 at 2:41 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Thanks so much for your help so far! I sent disable commands on 4
>>>>>>>>>>>>> rather small tables and they got stuck for a long time again, so I
>>>>>>>>>>>>> took jstack of the master. From what I can tell, all disableTable
>>>>>>>>>>>>> calls are blocked by a meta scanner thread (sample log below). At the
>>>>>>>>>>>>> moment there were no other requests to the server at all. How should I
>>>>>>>>>>>>> investigate this further? If it helps, here's some background: I have
>>>>>>>>>>>>> several datasets, each of them is in a separate table; Our data
>>>>>>>>>>>>> pipeline produces a new version of each dataset everyday and only the
>>>>>>>>>>>>> lastest version should be used. This is how the data is loaded: for
>>>>>>>>>>>>> each dataset 1) run a MR job and outputs HFiles 2) call loadTable.rb
>>>>>>>>>>>>> to create a new table 3) disable and drop the previous version. As a
>>>>>>>>>>>>> result some calls to load table and drop table would overlap. Please
>>>>>>>>>>>>> let me know if something stands out to you as a potential culprit.
>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>
>>>>>>>>>>>>> BTW, I am running Hadoop 0.20.2 with HBase 0.20.6
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thread 47 (IPC Server handler 13 on 60000):
>>>>>>>>>>>>>  State: BLOCKED
>>>>>>>>>>>>>  Blocked count: 3801
>>>>>>>>>>>>>  Waited count: 72719
>>>>>>>>>>>>>  Blocked on java.lang.Object@75ac522c
>>>>>>>>>>>>>  Blocked by 27 (RegionManager.metaScanner)
>>>>>>>>>>>>>  Stack:
>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154)
>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842)
>>>>>>>>>>>>>    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>>>>>>>>>>>    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>>>>>>>    java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>>>>>>>>>  State: WAITING
>>>>>>>>>>>>>  Blocked count: 1526058
>>>>>>>>>>>>>  Waited count: 1488998
>>>>>>>>>>>>>  Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0
>>>>>>>>>>>>>  Stack:
>>>>>>>>>>>>>    java.lang.Object.wait(Native Method)
>>>>>>>>>>>>>    java.lang.Object.wait(Object.java:485)
>>>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>>>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>>>>>>>>>>>>    $Proxy1.get(Unknown Source)
>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543)
>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192)
>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
>>>>>>>>>>>>>    org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Feb 25, 2011 at 10:23 AM, Jean-Daniel Cryans
>>>>>>>>>>>>> <jd...@apache.org> wrote:
>>>>>>>>>>>>>> An hour to disable? That doesn't sound right at all :)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would approach this problem like I generally do with HBase issue,
>>>>>>>>>>>>>> first check the master log for any weirdness regarding my problem (in
>>>>>>>>>>>>>> this case, grep for the table name).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Then I would look the region server log(s) of the nodes that were
>>>>>>>>>>>>>> hosting regions from that table. You should see the steps taken to
>>>>>>>>>>>>>> disable the regions (starting to close, flush, region completely
>>>>>>>>>>>>>> closed).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you are able to do it while it's taking a very long time to
>>>>>>>>>>>>>> disable, try to jstack the process the seems to be hanging.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Finally, like I said in another thread, there's a bug in 0.20.6 that
>>>>>>>>>>>>>> almost prevent disabling a table (or re-enabling) if any region
>>>>>>>>>>>>>> recently split and the parent wasn't cleaned yet from .META. and that
>>>>>>>>>>>>>> is fixed in 0.90.1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>>>>>>>> I think you are right, maybe in the long run I need to re-architect my
>>>>>>>>>>>>>>> system so that it doesn't need to create new and delete old tables all
>>>>>>>>>>>>>>> the time. In the short term I am having a really hard time with the
>>>>>>>>>>>>>>> disabling function, I ran a disable command on a very small table
>>>>>>>>>>>>>>> (probably dozen of MBs in size) and are no client using the cluster at
>>>>>>>>>>>>>>> all, and that took about 1 hour to complete! The weird thing is on the
>>>>>>>>>>>>>>> web UI only the region server carrying the META table has non-zero
>>>>>>>>>>>>>>> requests, all other RS have 0 requests the entire time. I would think
>>>>>>>>>>>>>>> they should get some request to flush the memstore at least. I *am*
>>>>>>>>>>>>>>> using the same RS nodes for some map reduce job at the time and top
>>>>>>>>>>>>>>> shows the memory usage is almost full on the META region. Would you
>>>>>>>>>>>>>>> have some idea of what I should investigate?
>>>>>>>>>>>>>>> Thanks so much.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Disabling a table taking very long time

Posted by Nanheng Wu <na...@gmail.com>.

I see, so the Get only gets affected on regions not in client's cache.
I think that explains why that error seems hard to reproduce. This is
how I understood a Get works: if the client doesn't have the region in
its cache it will read .META. table; and if it does it makes a RPC to
the RS directly; also each RS has some data about the regions it
actually hosts, so if the region is not there it means the client's
cache is stale and the client has to read the .META. again. Is this
correct?

On Tue, Mar 1, 2011 at 10:48 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
> Depends if you have to query .META. first to get the location of the
> region that hosts the row.
>
> J-D
>
> On Tue, Mar 1, 2011 at 10:45 AM, Nanheng Wu <na...@gmail.com> wrote:
>> Man I appreciate so much all the help you provided so far. I guess
>> I'll keep digging. Would this meta scan cause Get or Scan on user
>> tables to be very very slow too? That was actually was the problem I
>> first noticed: from time to time a client Get or Scan would be
>> extremely slow on my tables.
>>
>> On Tue, Mar 1, 2011 at 10:38 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>> The store file looks fairly new, maybe it was just major compacted?
>>> You could easily figure that out from the region server's log.
>>>
>>> Your problem is definitely a .META. region slowness. If scanning 8k
>>> rows takes 30 minutes, scanning and updating 260 rows at the same time
>>> probably isn't much faster.
>>>
>>> J-D
>>>
>>> On Tue, Mar 1, 2011 at 10:34 AM, Nanheng Wu <na...@gmail.com> wrote:
>>>> J-D,
>>>>
>>>> It looks like I don't have that many files:
>>>>
>>>> drwxr-xr-x   - root supergroup          0 2011-02-28 23:21
>>>> /hbase/.META./1028785192
>>>> -rw-r--r--   3 root supergroup        932 2011-02-03 19:23
>>>> /hbase/.META./1028785192/.regioninfo
>>>> drwxr-xr-x   - root supergroup          0 2011-03-01 17:13
>>>> /hbase/.META./1028785192/info
>>>> -rw-r--r--   3 root supergroup   11831330 2011-03-01 17:13
>>>> /hbase/.META./1028785192/info/6687606720393313750
>>>> drwxr-xr-x   - root supergroup          0 2011-03-01 17:13
>>>> /hbase/.META./compaction.dir
>>>>
>>>> So you think compaction on .META. table on the RS could be the
>>>> problem? while the log metaScanner is running I saw virtually no logs
>>>> on that server though, I did see compactions once the meta scan ended
>>>> and region close started to happen. I wonder what the cluster is doing
>>>> while the meta scan is going.
>>>>
>>>> I would definitely upgrade in the near future, it's just that I am
>>>> afraid upgrading won't guarantee a fix if I don't understand the
>>>> nature of the problem more clearly. We also have another test cluster,
>>>> running the same versions of everything on cheaper hardware and it
>>>> doesn't have this problem.
>>>>
>>>> On Tue, Mar 1, 2011 at 9:52 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>> That long .META. scan might be the culprit, good catch!
>>>>>
>>>>> If you run this from the hadoop folder:
>>>>>
>>>>> bin/hadoop dfs -lsr /hbase/.META.
>>>>>
>>>>> Do you see a ton of files? Let's define "ton" as more than 16 files.
>>>>> If so, and I see you have a very high number of regions, then you
>>>>> could be hitting this problem:
>>>>> https://issues.apache.org/jira/browse/HBASE-3499
>>>>>
>>>>> In short, every few .META. change will result in a flush and it could
>>>>> be that the region server hosting it isn't able to keep up with the
>>>>> compactions. It's done so that since append isn't supported in your
>>>>> hadoop that you won't lose too much data if the node dies, which is
>>>>> really bad when it happens to .META.
>>>>>
>>>>> In your case it might not be so bad to set your .META.'s MEMSTORE_SIZE
>>>>> back to the default value if you're not writing hence not splitting.
>>>>> Running the script attached in that jira would do it (don't forget to
>>>>> restart HBase after running it).
>>>>>
>>>>> Lastly, upgrading to HBase 0.90.1 and a hadoop that supports append
>>>>> should be a priority.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Tue, Mar 1, 2011 at 9:30 AM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>> Hi J-D:
>>>>>>
>>>>>>  I did the scan like you suggested but no splits came up. This kind
>>>>>> of makes sense to me, since we write Hfiles using a MR job and then
>>>>>> pretty much use HBase for read-only so maybe that's why the regions
>>>>>> never got split? I think my slow disable problem is related to this
>>>>>> RegionManager.metaScanner. I greped for that in the Masters log and I
>>>>>> saw:
>>>>>> 11/03/01 15:30:37 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>>>> .META.,,1, startKey: <>}
>>>>>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
>>>>>> regionname: .META.,,1, startKey: <>} complete
>>>>>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>>>> .META.,,1, startKey: <>}
>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
>>>>>> regionname: .META.,,1, startKey: <>} complete
>>>>>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>>>> .META.,,1, startKey: <>}
>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>> 11/03/01 16:46:07 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>> scan of 8298 row(s) of meta region {server: 10.146.14.38:60020,
>>>>>> regionname: .META.,,1, startKey: <>} complete
>>>>>>
>>>>>> It looks like meta scanner is always running, each time taking 20-30
>>>>>> minutes. Here's what I did:
>>>>>>
>>>>>> 1) disable table -> at 16:28:31, according the log above there's a
>>>>>> meta scan in progress.
>>>>>>
>>>>>> 2 ) the command hangs, no output in .META RS's log
>>>>>> at 11/03/01 16:46:57 the client got 11/03/01 16:46:58 DEBUG
>>>>>> zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server got
>>>>>> 10.146.14.38:60020
>>>>>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Found
>>>>>> ROOT at 10.146.14.38:60020
>>>>>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Cached
>>>>>> location for .META.,,1 is 10.146.14.38:60020
>>>>>> This corresponds exact to when the meta scan completed. .META RS
>>>>>> started to spew some logs about closing regions.
>>>>>>
>>>>>> 3) Another meta scan started and the client hangs again until it
>>>>>> finishes, and disable command returns.The whole thing took 2688.5930
>>>>>> seconds
>>>>>>
>>>>>> What kind of problem does it look like to you J-D? What does this
>>>>>> background task metaScanner do?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 28, 2011 at 5:17 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>>> I didn't get most of the information I was looking for... but I do
>>>>>>> have something to work with. So I asked whether the metaScanner was
>>>>>>> making progress or not, and from that log I think I see it does:
>>>>>>>
>>>>>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>>>>>>> Rowscanned=261, rowsOffline=3
>>>>>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>>>>>>>> regions to be disabled from myTable
>>>>>>>
>>>>>>> Only 3 rows are offline, whereas we expect all the region to be
>>>>>>> offline. I expect you see those lines a lot? If so, one thing I'd like
>>>>>>> to see is the result of this command:
>>>>>>>
>>>>>>> scan '.META.', {STARTROW => "myTable,,", LIMIT => 261}
>>>>>>>
>>>>>>> It's going to be big. Then grep in the result for the string SPLIT,
>>>>>>> and please post back here the lines that match.
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Mon, Feb 28, 2011 at 5:04 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>> I issued two more disable commands and looked at the .META RS. The
>>>>>>>> last log entry looks like:
>>>>>>>>
>>>>>>>> 11/03/01 00:29:14 INFO regionserver.HLog: Roll
>>>>>>>> /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog
>>>>>>>> .dat.1298935752660, entries=13527, calcsize=3839046, filesize=2560143.
>>>>>>>> New hlog /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog.dat.1298939354072
>>>>>>>>
>>>>>>>> I issued the command at: 00:34:59, the client debug message has
>>>>>>>> nothing but "11/03/01 00:34:59 DEBUG zookeeper.ClientCnxn: Got ping
>>>>>>>> response for sessionid:0x12e6e938e9c000b after 90ms" until 00:47:19.
>>>>>>>> Then all of a sudden it outputs the following:
>>>>>>>>
>>>>>>>> 0:47:28 DEBUG client.HConnectionManager$TableServers: Cache hit for
>>>>>>>> row <> in tableName .META.: location server 10.146.14.38:60020,
>>>>>>>> location region name .META.,,1
>>>>>>>> 11/03/01 00:47:28 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode
>>>>>>>> /hbase/root-region-server got 10.146.14.38:60020
>>>>>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Found
>>>>>>>> ROOT at 10.146.14.38:60020
>>>>>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Cached
>>>>>>>> location for .META.,,1 is 10.146.14.38:60020
>>>>>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>>>>>>> Rowscanned=261, rowsOffline=3
>>>>>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>>>>>>>> regions to be disabled from myTable
>>>>>>>> 11/03/01 00:47:31 DEBUG client.HBaseAdmin: Wake. Waiting for all
>>>>>>>> regions to be disabled from myTable
>>>>>>>>
>>>>>>>> Here's the full thread dump of the .META RS from 00:34:59 to  00:47:31
>>>>>>>> http://pastebin.com/Ywwx8afQ
>>>>>>>>
>>>>>>>> I hope I compiled some useful information here. I am pretty lost at
>>>>>>>> this point, I would really appreciate your help!
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 28, 2011 at 2:58 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>>>>> Ok so no debug from the disable command? BTW you should output your GC
>>>>>>>>> somewhere else like the .out file or a completely separate file.
>>>>>>>>>
>>>>>>>>> J-D
>>>>>>>>>
>>>>>>>>> On Mon, Feb 28, 2011 at 2:54 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>>> I see, so I should jstack the .META region. I'll do that.
>>>>>>>>>>
>>>>>>>>>> The master log pretty much looks like this: should I grep for
>>>>>>>>>> something specific?
>>>>>>>>>>
>>>>>>>>>> 11/02/28 22:52:56 INFO master.BaseScanner: RegionManager.rootScanner
>>>>>>>>>> scan of 1 row(s) of meta region complete
>>>>>>>>>> 2011-02-28T22:52:57.138+0000: 263241.014: [GC 263241.014: [ParNew:
>>>>>>>>>> 19136K->2112K(19136K), 0.0033350 secs] 378606K->36159
>>>>>>>>>> 6K(641944K) icms_dc=0 , 0.0034330 secs] [Times: user=0.02 sys=0.00,
>>>>>>>>>> real=0.00 secs]
>>>>>>>>>> 2011-02-28T22:53:04.956+0000: 263248.832: [GC 263248.832: [ParNew:
>>>>>>>>>> 19068K->2112K(19136K), 0.0029690 secs] 378552K->36163
>>>>>>>>>> 0K(641944K) icms_dc=0 , 0.0030460 secs] [Times: user=0.02 sys=0.00,
>>>>>>>>>> real=0.00 secs]
>>>>>>>>>> 2011-02-28T22:53:12.664+0000: 263256.540: [GC 263256.541: [ParNew:
>>>>>>>>>> 19136K->2112K(19136K), 0.0037690 secs] 378654K->36164
>>>>>>>>>> 7K(641944K) icms_dc=0 , 0.0038660 secs] [Times: user=0.01 sys=0.00,
>>>>>>>>>> real=0.01 secs]
>>>>>>>>>> 2011-02-28T22:53:20.608+0000: 263264.484: [GC 263264.484: [ParNew:
>>>>>>>>>> 19136K->1810K(19136K), 0.0032600 secs] 378671K->36134
>>>>>>>>>> 6K(641944K) icms_dc=0 , 0.0033470 secs] [Times: user=0.02 sys=0.00,
>>>>>>>>>> real=0.01 secs]
>>>>>>>>>> 2011-02-28T22:53:28.278+0000: 263272.154: [GC 263272.154: [ParNew:
>>>>>>>>>> 18834K->2112K(19136K), 0.0029870 secs] 378370K->36166
>>>>>>>>>> 1K(641944K) icms_dc=0 , 0.0030670 secs] [Times: user=0.01 sys=0.00,
>>>>>>>>>> real=0.01 secs]
>>>>>>>>>> 2011-02-28T22:53:35.919+0000: 263279.795: [GC 263279.795: [ParNew:
>>>>>>>>>> 19136K->1747K(19136K), 0.0037090 secs] 378685K->36129
>>>>>>>>>> 8K(641944K) icms_dc=0 , 0.0037920 secs] [Times: user=0.02 sys=0.00,
>>>>>>>>>> real=0.01 secs]
>>>>>>>>>> 11/02/28 22:53:36 INFO master.ServerManager: 9 region servers, 0 dead,
>>>>>>>>>> average load 760.2222222222222
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 28, 2011 at 2:49 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>>>>>>> Are you able to tell if that call in metaScanner is hanging or it's
>>>>>>>>>>> making multiple calls to the .META. region?
>>>>>>>>>>>
>>>>>>>>>>> If former, then jstack the region server that hosts .META. and see
>>>>>>>>>>> where it's blocked.
>>>>>>>>>>>
>>>>>>>>>>> if latter, then it means your .META. region is slow? Again, what's
>>>>>>>>>>> going on on the RS that hosts .META.?
>>>>>>>>>>>
>>>>>>>>>>> Finally, what's the master's log like during that time?
>>>>>>>>>>>
>>>>>>>>>>> J-D
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 28, 2011 at 2:41 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>>>>> J-D,
>>>>>>>>>>>>
>>>>>>>>>>>>  Thanks so much for your help so far! I sent disable commands on 4
>>>>>>>>>>>> rather small tables and they got stuck for a long time again, so I
>>>>>>>>>>>> took jstack of the master. From what I can tell, all disableTable
>>>>>>>>>>>> calls are blocked by a meta scanner thread (sample log below). At the
>>>>>>>>>>>> moment there were no other requests to the server at all. How should I
>>>>>>>>>>>> investigate this further? If it helps, here's some background: I have
>>>>>>>>>>>> several datasets, each of them is in a separate table; Our data
>>>>>>>>>>>> pipeline produces a new version of each dataset everyday and only the
>>>>>>>>>>>> lastest version should be used. This is how the data is loaded: for
>>>>>>>>>>>> each dataset 1) run a MR job and outputs HFiles 2) call loadTable.rb
>>>>>>>>>>>> to create a new table 3) disable and drop the previous version. As a
>>>>>>>>>>>> result some calls to load table and drop table would overlap. Please
>>>>>>>>>>>> let me know if something stands out to you as a potential culprit.
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>
>>>>>>>>>>>> BTW, I am running Hadoop 0.20.2 with HBase 0.20.6
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 47 (IPC Server handler 13 on 60000):
>>>>>>>>>>>>  State: BLOCKED
>>>>>>>>>>>>  Blocked count: 3801
>>>>>>>>>>>>  Waited count: 72719
>>>>>>>>>>>>  Blocked on java.lang.Object@75ac522c
>>>>>>>>>>>>  Blocked by 27 (RegionManager.metaScanner)
>>>>>>>>>>>>  Stack:
>>>>>>>>>>>>    org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154)
>>>>>>>>>>>>    org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842)
>>>>>>>>>>>>    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>>>>>>>>>>    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>>>>>>    java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>>>>>>>>  State: WAITING
>>>>>>>>>>>>  Blocked count: 1526058
>>>>>>>>>>>>  Waited count: 1488998
>>>>>>>>>>>>  Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0
>>>>>>>>>>>>  Stack:
>>>>>>>>>>>>    java.lang.Object.wait(Native Method)
>>>>>>>>>>>>    java.lang.Object.wait(Object.java:485)
>>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>>>>>>>>>>>    $Proxy1.get(Unknown Source)
>>>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543)
>>>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192)
>>>>>>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>>>>>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
>>>>>>>>>>>>    org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Feb 25, 2011 at 10:23 AM, Jean-Daniel Cryans
>>>>>>>>>>>> <jd...@apache.org> wrote:
>>>>>>>>>>>>> An hour to disable? That doesn't sound right at all :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would approach this problem like I generally do with HBase issue,
>>>>>>>>>>>>> first check the master log for any weirdness regarding my problem (in
>>>>>>>>>>>>> this case, grep for the table name).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then I would look the region server log(s) of the nodes that were
>>>>>>>>>>>>> hosting regions from that table. You should see the steps taken to
>>>>>>>>>>>>> disable the regions (starting to close, flush, region completely
>>>>>>>>>>>>> closed).
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you are able to do it while it's taking a very long time to
>>>>>>>>>>>>> disable, try to jstack the process the seems to be hanging.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Finally, like I said in another thread, there's a bug in 0.20.6 that
>>>>>>>>>>>>> almost prevent disabling a table (or re-enabling) if any region
>>>>>>>>>>>>> recently split and the parent wasn't cleaned yet from .META. and that
>>>>>>>>>>>>> is fixed in 0.90.1
>>>>>>>>>>>>>
>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>>>>>>> I think you are right, maybe in the long run I need to re-architect my
>>>>>>>>>>>>>> system so that it doesn't need to create new and delete old tables all
>>>>>>>>>>>>>> the time. In the short term I am having a really hard time with the
>>>>>>>>>>>>>> disabling function, I ran a disable command on a very small table
>>>>>>>>>>>>>> (probably dozen of MBs in size) and are no client using the cluster at
>>>>>>>>>>>>>> all, and that took about 1 hour to complete! The weird thing is on the
>>>>>>>>>>>>>> web UI only the region server carrying the META table has non-zero
>>>>>>>>>>>>>> requests, all other RS have 0 requests the entire time. I would think
>>>>>>>>>>>>>> they should get some request to flush the memstore at least. I *am*
>>>>>>>>>>>>>> using the same RS nodes for some map reduce job at the time and top
>>>>>>>>>>>>>> shows the memory usage is almost full on the META region. Would you
>>>>>>>>>>>>>> have some idea of what I should investigate?
>>>>>>>>>>>>>> Thanks so much.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Disabling a table taking very long time

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Depends if you have to query .META. first to get the location of the
region that hosts the row.

J-D

On Tue, Mar 1, 2011 at 10:45 AM, Nanheng Wu <na...@gmail.com> wrote:
> Man I appreciate so much all the help you provided so far. I guess
> I'll keep digging. Would this meta scan cause Get or Scan on user
> tables to be very very slow too? That was actually was the problem I
> first noticed: from time to time a client Get or Scan would be
> extremely slow on my tables.
>
> On Tue, Mar 1, 2011 at 10:38 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> The store file looks fairly new, maybe it was just major compacted?
>> You could easily figure that out from the region server's log.
>>
>> Your problem is definitely a .META. region slowness. If scanning 8k
>> rows takes 30 minutes, scanning and updating 260 rows at the same time
>> probably isn't much faster.
>>
>> J-D
>>
>> On Tue, Mar 1, 2011 at 10:34 AM, Nanheng Wu <na...@gmail.com> wrote:
>>> J-D,
>>>
>>> It looks like I don't have that many files:
>>>
>>> drwxr-xr-x   - root supergroup          0 2011-02-28 23:21
>>> /hbase/.META./1028785192
>>> -rw-r--r--   3 root supergroup        932 2011-02-03 19:23
>>> /hbase/.META./1028785192/.regioninfo
>>> drwxr-xr-x   - root supergroup          0 2011-03-01 17:13
>>> /hbase/.META./1028785192/info
>>> -rw-r--r--   3 root supergroup   11831330 2011-03-01 17:13
>>> /hbase/.META./1028785192/info/6687606720393313750
>>> drwxr-xr-x   - root supergroup          0 2011-03-01 17:13
>>> /hbase/.META./compaction.dir
>>>
>>> So you think compaction on .META. table on the RS could be the
>>> problem? while the log metaScanner is running I saw virtually no logs
>>> on that server though, I did see compactions once the meta scan ended
>>> and region close started to happen. I wonder what the cluster is doing
>>> while the meta scan is going.
>>>
>>> I would definitely upgrade in the near future, it's just that I am
>>> afraid upgrading won't guarantee a fix if I don't understand the
>>> nature of the problem more clearly. We also have another test cluster,
>>> running the same versions of everything on cheaper hardware and it
>>> doesn't have this problem.
>>>
>>> On Tue, Mar 1, 2011 at 9:52 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>> That long .META. scan might be the culprit, good catch!
>>>>
>>>> If you run this from the hadoop folder:
>>>>
>>>> bin/hadoop dfs -lsr /hbase/.META.
>>>>
>>>> Do you see a ton of files? Let's define "ton" as more than 16 files.
>>>> If so, and I see you have a very high number of regions, then you
>>>> could be hitting this problem:
>>>> https://issues.apache.org/jira/browse/HBASE-3499
>>>>
>>>> In short, every few .META. change will result in a flush and it could
>>>> be that the region server hosting it isn't able to keep up with the
>>>> compactions. It's done so that since append isn't supported in your
>>>> hadoop that you won't lose too much data if the node dies, which is
>>>> really bad when it happens to .META.
>>>>
>>>> In your case it might not be so bad to set your .META.'s MEMSTORE_SIZE
>>>> back to the default value if you're not writing hence not splitting.
>>>> Running the script attached in that jira would do it (don't forget to
>>>> restart HBase after running it).
>>>>
>>>> Lastly, upgrading to HBase 0.90.1 and a hadoop that supports append
>>>> should be a priority.
>>>>
>>>> J-D
>>>>
>>>> On Tue, Mar 1, 2011 at 9:30 AM, Nanheng Wu <na...@gmail.com> wrote:
>>>>> Hi J-D:
>>>>>
>>>>>  I did the scan like you suggested but no splits came up. This kind
>>>>> of makes sense to me, since we write Hfiles using a MR job and then
>>>>> pretty much use HBase for read-only so maybe that's why the regions
>>>>> never got split? I think my slow disable problem is related to this
>>>>> RegionManager.metaScanner. I greped for that in the Masters log and I
>>>>> saw:
>>>>> 11/03/01 15:30:37 INFO master.BaseScanner: RegionManager.metaScanner
>>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>>> .META.,,1, startKey: <>}
>>>>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
>>>>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
>>>>> regionname: .META.,,1, startKey: <>} complete
>>>>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
>>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>>> .META.,,1, startKey: <>}
>>>>> Thread 27 (RegionManager.metaScanner):
>>>>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
>>>>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
>>>>> regionname: .META.,,1, startKey: <>} complete
>>>>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
>>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>>> .META.,,1, startKey: <>}
>>>>> Thread 27 (RegionManager.metaScanner):
>>>>> 11/03/01 16:46:07 INFO master.BaseScanner: RegionManager.metaScanner
>>>>> scan of 8298 row(s) of meta region {server: 10.146.14.38:60020,
>>>>> regionname: .META.,,1, startKey: <>} complete
>>>>>
>>>>> It looks like meta scanner is always running, each time taking 20-30
>>>>> minutes. Here's what I did:
>>>>>
>>>>> 1) disable table -> at 16:28:31, according the log above there's a
>>>>> meta scan in progress.
>>>>>
>>>>> 2 ) the command hangs, no output in .META RS's log
>>>>> at 11/03/01 16:46:57 the client got 11/03/01 16:46:58 DEBUG
>>>>> zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server got
>>>>> 10.146.14.38:60020
>>>>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Found
>>>>> ROOT at 10.146.14.38:60020
>>>>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Cached
>>>>> location for .META.,,1 is 10.146.14.38:60020
>>>>> This corresponds exact to when the meta scan completed. .META RS
>>>>> started to spew some logs about closing regions.
>>>>>
>>>>> 3) Another meta scan started and the client hangs again until it
>>>>> finishes, and disable command returns.The whole thing took 2688.5930
>>>>> seconds
>>>>>
>>>>> What kind of problem does it look like to you J-D? What does this
>>>>> background task metaScanner do?
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 28, 2011 at 5:17 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>> I didn't get most of the information I was looking for... but I do
>>>>>> have something to work with. So I asked whether the metaScanner was
>>>>>> making progress or not, and from that log I think I see it does:
>>>>>>
>>>>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>>>>>> Rowscanned=261, rowsOffline=3
>>>>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>>>>>>> regions to be disabled from myTable
>>>>>>
>>>>>> Only 3 rows are offline, whereas we expect all the region to be
>>>>>> offline. I expect you see those lines a lot? If so, one thing I'd like
>>>>>> to see is the result of this command:
>>>>>>
>>>>>> scan '.META.', {STARTROW => "myTable,,", LIMIT => 261}
>>>>>>
>>>>>> It's going to be big. Then grep in the result for the string SPLIT,
>>>>>> and please post back here the lines that match.
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Mon, Feb 28, 2011 at 5:04 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>> I issued two more disable commands and looked at the .META RS. The
>>>>>>> last log entry looks like:
>>>>>>>
>>>>>>> 11/03/01 00:29:14 INFO regionserver.HLog: Roll
>>>>>>> /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog
>>>>>>> .dat.1298935752660, entries=13527, calcsize=3839046, filesize=2560143.
>>>>>>> New hlog /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog.dat.1298939354072
>>>>>>>
>>>>>>> I issued the command at: 00:34:59, the client debug message has
>>>>>>> nothing but "11/03/01 00:34:59 DEBUG zookeeper.ClientCnxn: Got ping
>>>>>>> response for sessionid:0x12e6e938e9c000b after 90ms" until 00:47:19.
>>>>>>> Then all of a sudden it outputs the following:
>>>>>>>
>>>>>>> 0:47:28 DEBUG client.HConnectionManager$TableServers: Cache hit for
>>>>>>> row <> in tableName .META.: location server 10.146.14.38:60020,
>>>>>>> location region name .META.,,1
>>>>>>> 11/03/01 00:47:28 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode
>>>>>>> /hbase/root-region-server got 10.146.14.38:60020
>>>>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Found
>>>>>>> ROOT at 10.146.14.38:60020
>>>>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Cached
>>>>>>> location for .META.,,1 is 10.146.14.38:60020
>>>>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>>>>>> Rowscanned=261, rowsOffline=3
>>>>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>>>>>>> regions to be disabled from myTable
>>>>>>> 11/03/01 00:47:31 DEBUG client.HBaseAdmin: Wake. Waiting for all
>>>>>>> regions to be disabled from myTable
>>>>>>>
>>>>>>> Here's the full thread dump of the .META RS from 00:34:59 to  00:47:31
>>>>>>> http://pastebin.com/Ywwx8afQ
>>>>>>>
>>>>>>> I hope I compiled some useful information here. I am pretty lost at
>>>>>>> this point, I would really appreciate your help!
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 28, 2011 at 2:58 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>>>> Ok so no debug from the disable command? BTW you should output your GC
>>>>>>>> somewhere else like the .out file or a completely separate file.
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Mon, Feb 28, 2011 at 2:54 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>> I see, so I should jstack the .META region. I'll do that.
>>>>>>>>>
>>>>>>>>> The master log pretty much looks like this: should I grep for
>>>>>>>>> something specific?
>>>>>>>>>
>>>>>>>>> 11/02/28 22:52:56 INFO master.BaseScanner: RegionManager.rootScanner
>>>>>>>>> scan of 1 row(s) of meta region complete
>>>>>>>>> 2011-02-28T22:52:57.138+0000: 263241.014: [GC 263241.014: [ParNew:
>>>>>>>>> 19136K->2112K(19136K), 0.0033350 secs] 378606K->36159
>>>>>>>>> 6K(641944K) icms_dc=0 , 0.0034330 secs] [Times: user=0.02 sys=0.00,
>>>>>>>>> real=0.00 secs]
>>>>>>>>> 2011-02-28T22:53:04.956+0000: 263248.832: [GC 263248.832: [ParNew:
>>>>>>>>> 19068K->2112K(19136K), 0.0029690 secs] 378552K->36163
>>>>>>>>> 0K(641944K) icms_dc=0 , 0.0030460 secs] [Times: user=0.02 sys=0.00,
>>>>>>>>> real=0.00 secs]
>>>>>>>>> 2011-02-28T22:53:12.664+0000: 263256.540: [GC 263256.541: [ParNew:
>>>>>>>>> 19136K->2112K(19136K), 0.0037690 secs] 378654K->36164
>>>>>>>>> 7K(641944K) icms_dc=0 , 0.0038660 secs] [Times: user=0.01 sys=0.00,
>>>>>>>>> real=0.01 secs]
>>>>>>>>> 2011-02-28T22:53:20.608+0000: 263264.484: [GC 263264.484: [ParNew:
>>>>>>>>> 19136K->1810K(19136K), 0.0032600 secs] 378671K->36134
>>>>>>>>> 6K(641944K) icms_dc=0 , 0.0033470 secs] [Times: user=0.02 sys=0.00,
>>>>>>>>> real=0.01 secs]
>>>>>>>>> 2011-02-28T22:53:28.278+0000: 263272.154: [GC 263272.154: [ParNew:
>>>>>>>>> 18834K->2112K(19136K), 0.0029870 secs] 378370K->36166
>>>>>>>>> 1K(641944K) icms_dc=0 , 0.0030670 secs] [Times: user=0.01 sys=0.00,
>>>>>>>>> real=0.01 secs]
>>>>>>>>> 2011-02-28T22:53:35.919+0000: 263279.795: [GC 263279.795: [ParNew:
>>>>>>>>> 19136K->1747K(19136K), 0.0037090 secs] 378685K->36129
>>>>>>>>> 8K(641944K) icms_dc=0 , 0.0037920 secs] [Times: user=0.02 sys=0.00,
>>>>>>>>> real=0.01 secs]
>>>>>>>>> 11/02/28 22:53:36 INFO master.ServerManager: 9 region servers, 0 dead,
>>>>>>>>> average load 760.2222222222222
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 28, 2011 at 2:49 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>>>>>> Are you able to tell if that call in metaScanner is hanging or it's
>>>>>>>>>> making multiple calls to the .META. region?
>>>>>>>>>>
>>>>>>>>>> If former, then jstack the region server that hosts .META. and see
>>>>>>>>>> where it's blocked.
>>>>>>>>>>
>>>>>>>>>> if latter, then it means your .META. region is slow? Again, what's
>>>>>>>>>> going on on the RS that hosts .META.?
>>>>>>>>>>
>>>>>>>>>> Finally, what's the master's log like during that time?
>>>>>>>>>>
>>>>>>>>>> J-D
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 28, 2011 at 2:41 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>>>> J-D,
>>>>>>>>>>>
>>>>>>>>>>>  Thanks so much for your help so far! I sent disable commands on 4
>>>>>>>>>>> rather small tables and they got stuck for a long time again, so I
>>>>>>>>>>> took jstack of the master. From what I can tell, all disableTable
>>>>>>>>>>> calls are blocked by a meta scanner thread (sample log below). At the
>>>>>>>>>>> moment there were no other requests to the server at all. How should I
>>>>>>>>>>> investigate this further? If it helps, here's some background: I have
>>>>>>>>>>> several datasets, each of them is in a separate table; Our data
>>>>>>>>>>> pipeline produces a new version of each dataset everyday and only the
>>>>>>>>>>> lastest version should be used. This is how the data is loaded: for
>>>>>>>>>>> each dataset 1) run a MR job and outputs HFiles 2) call loadTable.rb
>>>>>>>>>>> to create a new table 3) disable and drop the previous version. As a
>>>>>>>>>>> result some calls to load table and drop table would overlap. Please
>>>>>>>>>>> let me know if something stands out to you as a potential culprit.
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>> BTW, I am running Hadoop 0.20.2 with HBase 0.20.6
>>>>>>>>>>>
>>>>>>>>>>> Thread 47 (IPC Server handler 13 on 60000):
>>>>>>>>>>>  State: BLOCKED
>>>>>>>>>>>  Blocked count: 3801
>>>>>>>>>>>  Waited count: 72719
>>>>>>>>>>>  Blocked on java.lang.Object@75ac522c
>>>>>>>>>>>  Blocked by 27 (RegionManager.metaScanner)
>>>>>>>>>>>  Stack:
>>>>>>>>>>>    org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154)
>>>>>>>>>>>    org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842)
>>>>>>>>>>>    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>>>>>>>>>    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>>>>>    java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>>>>>>>  State: WAITING
>>>>>>>>>>>  Blocked count: 1526058
>>>>>>>>>>>  Waited count: 1488998
>>>>>>>>>>>  Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0
>>>>>>>>>>>  Stack:
>>>>>>>>>>>    java.lang.Object.wait(Native Method)
>>>>>>>>>>>    java.lang.Object.wait(Object.java:485)
>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>>>>>>>>>>    $Proxy1.get(Unknown Source)
>>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543)
>>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192)
>>>>>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>>>>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
>>>>>>>>>>>    org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Feb 25, 2011 at 10:23 AM, Jean-Daniel Cryans
>>>>>>>>>>> <jd...@apache.org> wrote:
>>>>>>>>>>>> An hour to disable? That doesn't sound right at all :)
>>>>>>>>>>>>
>>>>>>>>>>>> I would approach this problem like I generally do with HBase issue,
>>>>>>>>>>>> first check the master log for any weirdness regarding my problem (in
>>>>>>>>>>>> this case, grep for the table name).
>>>>>>>>>>>>
>>>>>>>>>>>> Then I would look the region server log(s) of the nodes that were
>>>>>>>>>>>> hosting regions from that table. You should see the steps taken to
>>>>>>>>>>>> disable the regions (starting to close, flush, region completely
>>>>>>>>>>>> closed).
>>>>>>>>>>>>
>>>>>>>>>>>> If you are able to do it while it's taking a very long time to
>>>>>>>>>>>> disable, try to jstack the process the seems to be hanging.
>>>>>>>>>>>>
>>>>>>>>>>>> Finally, like I said in another thread, there's a bug in 0.20.6 that
>>>>>>>>>>>> almost prevent disabling a table (or re-enabling) if any region
>>>>>>>>>>>> recently split and the parent wasn't cleaned yet from .META. and that
>>>>>>>>>>>> is fixed in 0.90.1
>>>>>>>>>>>>
>>>>>>>>>>>> J-D
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>>>>>> I think you are right, maybe in the long run I need to re-architect my
>>>>>>>>>>>>> system so that it doesn't need to create new and delete old tables all
>>>>>>>>>>>>> the time. In the short term I am having a really hard time with the
>>>>>>>>>>>>> disabling function, I ran a disable command on a very small table
>>>>>>>>>>>>> (probably dozen of MBs in size) and are no client using the cluster at
>>>>>>>>>>>>> all, and that took about 1 hour to complete! The weird thing is on the
>>>>>>>>>>>>> web UI only the region server carrying the META table has non-zero
>>>>>>>>>>>>> requests, all other RS have 0 requests the entire time. I would think
>>>>>>>>>>>>> they should get some request to flush the memstore at least. I *am*
>>>>>>>>>>>>> using the same RS nodes for some map reduce job at the time and top
>>>>>>>>>>>>> shows the memory usage is almost full on the META region. Would you
>>>>>>>>>>>>> have some idea of what I should investigate?
>>>>>>>>>>>>> Thanks so much.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Disabling a table taking very long time

Posted by Nanheng Wu <na...@gmail.com>.

Man I appreciate so much all the help you provided so far. I guess
I'll keep digging. Would this meta scan cause Get or Scan on user
tables to be very very slow too? That was actually was the problem I
first noticed: from time to time a client Get or Scan would be
extremely slow on my tables.

On Tue, Mar 1, 2011 at 10:38 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
> The store file looks fairly new, maybe it was just major compacted?
> You could easily figure that out from the region server's log.
>
> Your problem is definitely a .META. region slowness. If scanning 8k
> rows takes 30 minutes, scanning and updating 260 rows at the same time
> probably isn't much faster.
>
> J-D
>
> On Tue, Mar 1, 2011 at 10:34 AM, Nanheng Wu <na...@gmail.com> wrote:
>> J-D,
>>
>> It looks like I don't have that many files:
>>
>> drwxr-xr-x   - root supergroup          0 2011-02-28 23:21
>> /hbase/.META./1028785192
>> -rw-r--r--   3 root supergroup        932 2011-02-03 19:23
>> /hbase/.META./1028785192/.regioninfo
>> drwxr-xr-x   - root supergroup          0 2011-03-01 17:13
>> /hbase/.META./1028785192/info
>> -rw-r--r--   3 root supergroup   11831330 2011-03-01 17:13
>> /hbase/.META./1028785192/info/6687606720393313750
>> drwxr-xr-x   - root supergroup          0 2011-03-01 17:13
>> /hbase/.META./compaction.dir
>>
>> So you think compaction on .META. table on the RS could be the
>> problem? while the log metaScanner is running I saw virtually no logs
>> on that server though, I did see compactions once the meta scan ended
>> and region close started to happen. I wonder what the cluster is doing
>> while the meta scan is going.
>>
>> I would definitely upgrade in the near future, it's just that I am
>> afraid upgrading won't guarantee a fix if I don't understand the
>> nature of the problem more clearly. We also have another test cluster,
>> running the same versions of everything on cheaper hardware and it
>> doesn't have this problem.
>>
>> On Tue, Mar 1, 2011 at 9:52 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>> That long .META. scan might be the culprit, good catch!
>>>
>>> If you run this from the hadoop folder:
>>>
>>> bin/hadoop dfs -lsr /hbase/.META.
>>>
>>> Do you see a ton of files? Let's define "ton" as more than 16 files.
>>> If so, and I see you have a very high number of regions, then you
>>> could be hitting this problem:
>>> https://issues.apache.org/jira/browse/HBASE-3499
>>>
>>> In short, every few .META. change will result in a flush and it could
>>> be that the region server hosting it isn't able to keep up with the
>>> compactions. It's done so that since append isn't supported in your
>>> hadoop that you won't lose too much data if the node dies, which is
>>> really bad when it happens to .META.
>>>
>>> In your case it might not be so bad to set your .META.'s MEMSTORE_SIZE
>>> back to the default value if you're not writing hence not splitting.
>>> Running the script attached in that jira would do it (don't forget to
>>> restart HBase after running it).
>>>
>>> Lastly, upgrading to HBase 0.90.1 and a hadoop that supports append
>>> should be a priority.
>>>
>>> J-D
>>>
>>> On Tue, Mar 1, 2011 at 9:30 AM, Nanheng Wu <na...@gmail.com> wrote:
>>>> Hi J-D:
>>>>
>>>>  I did the scan like you suggested but no splits came up. This kind
>>>> of makes sense to me, since we write Hfiles using a MR job and then
>>>> pretty much use HBase for read-only so maybe that's why the regions
>>>> never got split? I think my slow disable problem is related to this
>>>> RegionManager.metaScanner. I greped for that in the Masters log and I
>>>> saw:
>>>> 11/03/01 15:30:37 INFO master.BaseScanner: RegionManager.metaScanner
>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>> .META.,,1, startKey: <>}
>>>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
>>>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
>>>> regionname: .META.,,1, startKey: <>} complete
>>>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>> .META.,,1, startKey: <>}
>>>> Thread 27 (RegionManager.metaScanner):
>>>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
>>>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
>>>> regionname: .META.,,1, startKey: <>} complete
>>>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>> .META.,,1, startKey: <>}
>>>> Thread 27 (RegionManager.metaScanner):
>>>> 11/03/01 16:46:07 INFO master.BaseScanner: RegionManager.metaScanner
>>>> scan of 8298 row(s) of meta region {server: 10.146.14.38:60020,
>>>> regionname: .META.,,1, startKey: <>} complete
>>>>
>>>> It looks like meta scanner is always running, each time taking 20-30
>>>> minutes. Here's what I did:
>>>>
>>>> 1) disable table -> at 16:28:31, according the log above there's a
>>>> meta scan in progress.
>>>>
>>>> 2 ) the command hangs, no output in .META RS's log
>>>> at 11/03/01 16:46:57 the client got 11/03/01 16:46:58 DEBUG
>>>> zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server got
>>>> 10.146.14.38:60020
>>>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Found
>>>> ROOT at 10.146.14.38:60020
>>>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Cached
>>>> location for .META.,,1 is 10.146.14.38:60020
>>>> This corresponds exact to when the meta scan completed. .META RS
>>>> started to spew some logs about closing regions.
>>>>
>>>> 3) Another meta scan started and the client hangs again until it
>>>> finishes, and disable command returns.The whole thing took 2688.5930
>>>> seconds
>>>>
>>>> What kind of problem does it look like to you J-D? What does this
>>>> background task metaScanner do?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>> On Mon, Feb 28, 2011 at 5:17 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>> I didn't get most of the information I was looking for... but I do
>>>>> have something to work with. So I asked whether the metaScanner was
>>>>> making progress or not, and from that log I think I see it does:
>>>>>
>>>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>>>>> Rowscanned=261, rowsOffline=3
>>>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>>>>>> regions to be disabled from myTable
>>>>>
>>>>> Only 3 rows are offline, whereas we expect all the region to be
>>>>> offline. I expect you see those lines a lot? If so, one thing I'd like
>>>>> to see is the result of this command:
>>>>>
>>>>> scan '.META.', {STARTROW => "myTable,,", LIMIT => 261}
>>>>>
>>>>> It's going to be big. Then grep in the result for the string SPLIT,
>>>>> and please post back here the lines that match.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Mon, Feb 28, 2011 at 5:04 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>> I issued two more disable commands and looked at the .META RS. The
>>>>>> last log entry looks like:
>>>>>>
>>>>>> 11/03/01 00:29:14 INFO regionserver.HLog: Roll
>>>>>> /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog
>>>>>> .dat.1298935752660, entries=13527, calcsize=3839046, filesize=2560143.
>>>>>> New hlog /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog.dat.1298939354072
>>>>>>
>>>>>> I issued the command at: 00:34:59, the client debug message has
>>>>>> nothing but "11/03/01 00:34:59 DEBUG zookeeper.ClientCnxn: Got ping
>>>>>> response for sessionid:0x12e6e938e9c000b after 90ms" until 00:47:19.
>>>>>> Then all of a sudden it outputs the following:
>>>>>>
>>>>>> 0:47:28 DEBUG client.HConnectionManager$TableServers: Cache hit for
>>>>>> row <> in tableName .META.: location server 10.146.14.38:60020,
>>>>>> location region name .META.,,1
>>>>>> 11/03/01 00:47:28 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode
>>>>>> /hbase/root-region-server got 10.146.14.38:60020
>>>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Found
>>>>>> ROOT at 10.146.14.38:60020
>>>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Cached
>>>>>> location for .META.,,1 is 10.146.14.38:60020
>>>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>>>>> Rowscanned=261, rowsOffline=3
>>>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>>>>>> regions to be disabled from myTable
>>>>>> 11/03/01 00:47:31 DEBUG client.HBaseAdmin: Wake. Waiting for all
>>>>>> regions to be disabled from myTable
>>>>>>
>>>>>> Here's the full thread dump of the .META RS from 00:34:59 to  00:47:31
>>>>>> http://pastebin.com/Ywwx8afQ
>>>>>>
>>>>>> I hope I compiled some useful information here. I am pretty lost at
>>>>>> this point, I would really appreciate your help!
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 28, 2011 at 2:58 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>>> Ok so no debug from the disable command? BTW you should output your GC
>>>>>>> somewhere else like the .out file or a completely separate file.
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Mon, Feb 28, 2011 at 2:54 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>> I see, so I should jstack the .META region. I'll do that.
>>>>>>>>
>>>>>>>> The master log pretty much looks like this: should I grep for
>>>>>>>> something specific?
>>>>>>>>
>>>>>>>> 11/02/28 22:52:56 INFO master.BaseScanner: RegionManager.rootScanner
>>>>>>>> scan of 1 row(s) of meta region complete
>>>>>>>> 2011-02-28T22:52:57.138+0000: 263241.014: [GC 263241.014: [ParNew:
>>>>>>>> 19136K->2112K(19136K), 0.0033350 secs] 378606K->36159
>>>>>>>> 6K(641944K) icms_dc=0 , 0.0034330 secs] [Times: user=0.02 sys=0.00,
>>>>>>>> real=0.00 secs]
>>>>>>>> 2011-02-28T22:53:04.956+0000: 263248.832: [GC 263248.832: [ParNew:
>>>>>>>> 19068K->2112K(19136K), 0.0029690 secs] 378552K->36163
>>>>>>>> 0K(641944K) icms_dc=0 , 0.0030460 secs] [Times: user=0.02 sys=0.00,
>>>>>>>> real=0.00 secs]
>>>>>>>> 2011-02-28T22:53:12.664+0000: 263256.540: [GC 263256.541: [ParNew:
>>>>>>>> 19136K->2112K(19136K), 0.0037690 secs] 378654K->36164
>>>>>>>> 7K(641944K) icms_dc=0 , 0.0038660 secs] [Times: user=0.01 sys=0.00,
>>>>>>>> real=0.01 secs]
>>>>>>>> 2011-02-28T22:53:20.608+0000: 263264.484: [GC 263264.484: [ParNew:
>>>>>>>> 19136K->1810K(19136K), 0.0032600 secs] 378671K->36134
>>>>>>>> 6K(641944K) icms_dc=0 , 0.0033470 secs] [Times: user=0.02 sys=0.00,
>>>>>>>> real=0.01 secs]
>>>>>>>> 2011-02-28T22:53:28.278+0000: 263272.154: [GC 263272.154: [ParNew:
>>>>>>>> 18834K->2112K(19136K), 0.0029870 secs] 378370K->36166
>>>>>>>> 1K(641944K) icms_dc=0 , 0.0030670 secs] [Times: user=0.01 sys=0.00,
>>>>>>>> real=0.01 secs]
>>>>>>>> 2011-02-28T22:53:35.919+0000: 263279.795: [GC 263279.795: [ParNew:
>>>>>>>> 19136K->1747K(19136K), 0.0037090 secs] 378685K->36129
>>>>>>>> 8K(641944K) icms_dc=0 , 0.0037920 secs] [Times: user=0.02 sys=0.00,
>>>>>>>> real=0.01 secs]
>>>>>>>> 11/02/28 22:53:36 INFO master.ServerManager: 9 region servers, 0 dead,
>>>>>>>> average load 760.2222222222222
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 28, 2011 at 2:49 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>>>>> Are you able to tell if that call in metaScanner is hanging or it's
>>>>>>>>> making multiple calls to the .META. region?
>>>>>>>>>
>>>>>>>>> If former, then jstack the region server that hosts .META. and see
>>>>>>>>> where it's blocked.
>>>>>>>>>
>>>>>>>>> if latter, then it means your .META. region is slow? Again, what's
>>>>>>>>> going on on the RS that hosts .META.?
>>>>>>>>>
>>>>>>>>> Finally, what's the master's log like during that time?
>>>>>>>>>
>>>>>>>>> J-D
>>>>>>>>>
>>>>>>>>> On Mon, Feb 28, 2011 at 2:41 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>>> J-D,
>>>>>>>>>>
>>>>>>>>>>  Thanks so much for your help so far! I sent disable commands on 4
>>>>>>>>>> rather small tables and they got stuck for a long time again, so I
>>>>>>>>>> took jstack of the master. From what I can tell, all disableTable
>>>>>>>>>> calls are blocked by a meta scanner thread (sample log below). At the
>>>>>>>>>> moment there were no other requests to the server at all. How should I
>>>>>>>>>> investigate this further? If it helps, here's some background: I have
>>>>>>>>>> several datasets, each of them is in a separate table; Our data
>>>>>>>>>> pipeline produces a new version of each dataset everyday and only the
>>>>>>>>>> lastest version should be used. This is how the data is loaded: for
>>>>>>>>>> each dataset 1) run a MR job and outputs HFiles 2) call loadTable.rb
>>>>>>>>>> to create a new table 3) disable and drop the previous version. As a
>>>>>>>>>> result some calls to load table and drop table would overlap. Please
>>>>>>>>>> let me know if something stands out to you as a potential culprit.
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> BTW, I am running Hadoop 0.20.2 with HBase 0.20.6
>>>>>>>>>>
>>>>>>>>>> Thread 47 (IPC Server handler 13 on 60000):
>>>>>>>>>>  State: BLOCKED
>>>>>>>>>>  Blocked count: 3801
>>>>>>>>>>  Waited count: 72719
>>>>>>>>>>  Blocked on java.lang.Object@75ac522c
>>>>>>>>>>  Blocked by 27 (RegionManager.metaScanner)
>>>>>>>>>>  Stack:
>>>>>>>>>>    org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154)
>>>>>>>>>>    org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842)
>>>>>>>>>>    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>>>>>>>>    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>>>>    java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>>>>>>  State: WAITING
>>>>>>>>>>  Blocked count: 1526058
>>>>>>>>>>  Waited count: 1488998
>>>>>>>>>>  Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0
>>>>>>>>>>  Stack:
>>>>>>>>>>    java.lang.Object.wait(Native Method)
>>>>>>>>>>    java.lang.Object.wait(Object.java:485)
>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>>>>>>>>>    $Proxy1.get(Unknown Source)
>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543)
>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192)
>>>>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>>>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
>>>>>>>>>>    org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 25, 2011 at 10:23 AM, Jean-Daniel Cryans
>>>>>>>>>> <jd...@apache.org> wrote:
>>>>>>>>>>> An hour to disable? That doesn't sound right at all :)
>>>>>>>>>>>
>>>>>>>>>>> I would approach this problem like I generally do with HBase issue,
>>>>>>>>>>> first check the master log for any weirdness regarding my problem (in
>>>>>>>>>>> this case, grep for the table name).
>>>>>>>>>>>
>>>>>>>>>>> Then I would look the region server log(s) of the nodes that were
>>>>>>>>>>> hosting regions from that table. You should see the steps taken to
>>>>>>>>>>> disable the regions (starting to close, flush, region completely
>>>>>>>>>>> closed).
>>>>>>>>>>>
>>>>>>>>>>> If you are able to do it while it's taking a very long time to
>>>>>>>>>>> disable, try to jstack the process the seems to be hanging.
>>>>>>>>>>>
>>>>>>>>>>> Finally, like I said in another thread, there's a bug in 0.20.6 that
>>>>>>>>>>> almost prevent disabling a table (or re-enabling) if any region
>>>>>>>>>>> recently split and the parent wasn't cleaned yet from .META. and that
>>>>>>>>>>> is fixed in 0.90.1
>>>>>>>>>>>
>>>>>>>>>>> J-D
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>>>>> I think you are right, maybe in the long run I need to re-architect my
>>>>>>>>>>>> system so that it doesn't need to create new and delete old tables all
>>>>>>>>>>>> the time. In the short term I am having a really hard time with the
>>>>>>>>>>>> disabling function, I ran a disable command on a very small table
>>>>>>>>>>>> (probably dozen of MBs in size) and are no client using the cluster at
>>>>>>>>>>>> all, and that took about 1 hour to complete! The weird thing is on the
>>>>>>>>>>>> web UI only the region server carrying the META table has non-zero
>>>>>>>>>>>> requests, all other RS have 0 requests the entire time. I would think
>>>>>>>>>>>> they should get some request to flush the memstore at least. I *am*
>>>>>>>>>>>> using the same RS nodes for some map reduce job at the time and top
>>>>>>>>>>>> shows the memory usage is almost full on the META region. Would you
>>>>>>>>>>>> have some idea of what I should investigate?
>>>>>>>>>>>> Thanks so much.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Disabling a table taking very long time

Posted by Jean-Daniel Cryans <jd...@apache.org>.

The store file looks fairly new, maybe it was just major compacted?
You could easily figure that out from the region server's log.

Your problem is definitely a .META. region slowness. If scanning 8k
rows takes 30 minutes, scanning and updating 260 rows at the same time
probably isn't much faster.

J-D

On Tue, Mar 1, 2011 at 10:34 AM, Nanheng Wu <na...@gmail.com> wrote:
> J-D,
>
> It looks like I don't have that many files:
>
> drwxr-xr-x   - root supergroup          0 2011-02-28 23:21
> /hbase/.META./1028785192
> -rw-r--r--   3 root supergroup        932 2011-02-03 19:23
> /hbase/.META./1028785192/.regioninfo
> drwxr-xr-x   - root supergroup          0 2011-03-01 17:13
> /hbase/.META./1028785192/info
> -rw-r--r--   3 root supergroup   11831330 2011-03-01 17:13
> /hbase/.META./1028785192/info/6687606720393313750
> drwxr-xr-x   - root supergroup          0 2011-03-01 17:13
> /hbase/.META./compaction.dir
>
> So you think compaction on .META. table on the RS could be the
> problem? while the log metaScanner is running I saw virtually no logs
> on that server though, I did see compactions once the meta scan ended
> and region close started to happen. I wonder what the cluster is doing
> while the meta scan is going.
>
> I would definitely upgrade in the near future, it's just that I am
> afraid upgrading won't guarantee a fix if I don't understand the
> nature of the problem more clearly. We also have another test cluster,
> running the same versions of everything on cheaper hardware and it
> doesn't have this problem.
>
> On Tue, Mar 1, 2011 at 9:52 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> That long .META. scan might be the culprit, good catch!
>>
>> If you run this from the hadoop folder:
>>
>> bin/hadoop dfs -lsr /hbase/.META.
>>
>> Do you see a ton of files? Let's define "ton" as more than 16 files.
>> If so, and I see you have a very high number of regions, then you
>> could be hitting this problem:
>> https://issues.apache.org/jira/browse/HBASE-3499
>>
>> In short, every few .META. change will result in a flush and it could
>> be that the region server hosting it isn't able to keep up with the
>> compactions. It's done so that since append isn't supported in your
>> hadoop that you won't lose too much data if the node dies, which is
>> really bad when it happens to .META.
>>
>> In your case it might not be so bad to set your .META.'s MEMSTORE_SIZE
>> back to the default value if you're not writing hence not splitting.
>> Running the script attached in that jira would do it (don't forget to
>> restart HBase after running it).
>>
>> Lastly, upgrading to HBase 0.90.1 and a hadoop that supports append
>> should be a priority.
>>
>> J-D
>>
>> On Tue, Mar 1, 2011 at 9:30 AM, Nanheng Wu <na...@gmail.com> wrote:
>>> Hi J-D:
>>>
>>>  I did the scan like you suggested but no splits came up. This kind
>>> of makes sense to me, since we write Hfiles using a MR job and then
>>> pretty much use HBase for read-only so maybe that's why the regions
>>> never got split? I think my slow disable problem is related to this
>>> RegionManager.metaScanner. I greped for that in the Masters log and I
>>> saw:
>>> 11/03/01 15:30:37 INFO master.BaseScanner: RegionManager.metaScanner
>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>> .META.,,1, startKey: <>}
>>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
>>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
>>> regionname: .META.,,1, startKey: <>} complete
>>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>> .META.,,1, startKey: <>}
>>> Thread 27 (RegionManager.metaScanner):
>>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
>>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
>>> regionname: .META.,,1, startKey: <>} complete
>>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>> .META.,,1, startKey: <>}
>>> Thread 27 (RegionManager.metaScanner):
>>> 11/03/01 16:46:07 INFO master.BaseScanner: RegionManager.metaScanner
>>> scan of 8298 row(s) of meta region {server: 10.146.14.38:60020,
>>> regionname: .META.,,1, startKey: <>} complete
>>>
>>> It looks like meta scanner is always running, each time taking 20-30
>>> minutes. Here's what I did:
>>>
>>> 1) disable table -> at 16:28:31, according the log above there's a
>>> meta scan in progress.
>>>
>>> 2 ) the command hangs, no output in .META RS's log
>>> at 11/03/01 16:46:57 the client got 11/03/01 16:46:58 DEBUG
>>> zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server got
>>> 10.146.14.38:60020
>>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Found
>>> ROOT at 10.146.14.38:60020
>>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Cached
>>> location for .META.,,1 is 10.146.14.38:60020
>>> This corresponds exact to when the meta scan completed. .META RS
>>> started to spew some logs about closing regions.
>>>
>>> 3) Another meta scan started and the client hangs again until it
>>> finishes, and disable command returns.The whole thing took 2688.5930
>>> seconds
>>>
>>> What kind of problem does it look like to you J-D? What does this
>>> background task metaScanner do?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> On Mon, Feb 28, 2011 at 5:17 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>> I didn't get most of the information I was looking for... but I do
>>>> have something to work with. So I asked whether the metaScanner was
>>>> making progress or not, and from that log I think I see it does:
>>>>
>>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>>>> Rowscanned=261, rowsOffline=3
>>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>>>>> regions to be disabled from myTable
>>>>
>>>> Only 3 rows are offline, whereas we expect all the region to be
>>>> offline. I expect you see those lines a lot? If so, one thing I'd like
>>>> to see is the result of this command:
>>>>
>>>> scan '.META.', {STARTROW => "myTable,,", LIMIT => 261}
>>>>
>>>> It's going to be big. Then grep in the result for the string SPLIT,
>>>> and please post back here the lines that match.
>>>>
>>>> J-D
>>>>
>>>> On Mon, Feb 28, 2011 at 5:04 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>> I issued two more disable commands and looked at the .META RS. The
>>>>> last log entry looks like:
>>>>>
>>>>> 11/03/01 00:29:14 INFO regionserver.HLog: Roll
>>>>> /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog
>>>>> .dat.1298935752660, entries=13527, calcsize=3839046, filesize=2560143.
>>>>> New hlog /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog.dat.1298939354072
>>>>>
>>>>> I issued the command at: 00:34:59, the client debug message has
>>>>> nothing but "11/03/01 00:34:59 DEBUG zookeeper.ClientCnxn: Got ping
>>>>> response for sessionid:0x12e6e938e9c000b after 90ms" until 00:47:19.
>>>>> Then all of a sudden it outputs the following:
>>>>>
>>>>> 0:47:28 DEBUG client.HConnectionManager$TableServers: Cache hit for
>>>>> row <> in tableName .META.: location server 10.146.14.38:60020,
>>>>> location region name .META.,,1
>>>>> 11/03/01 00:47:28 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode
>>>>> /hbase/root-region-server got 10.146.14.38:60020
>>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Found
>>>>> ROOT at 10.146.14.38:60020
>>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Cached
>>>>> location for .META.,,1 is 10.146.14.38:60020
>>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>>>> Rowscanned=261, rowsOffline=3
>>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>>>>> regions to be disabled from myTable
>>>>> 11/03/01 00:47:31 DEBUG client.HBaseAdmin: Wake. Waiting for all
>>>>> regions to be disabled from myTable
>>>>>
>>>>> Here's the full thread dump of the .META RS from 00:34:59 to  00:47:31
>>>>> http://pastebin.com/Ywwx8afQ
>>>>>
>>>>> I hope I compiled some useful information here. I am pretty lost at
>>>>> this point, I would really appreciate your help!
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 28, 2011 at 2:58 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>> Ok so no debug from the disable command? BTW you should output your GC
>>>>>> somewhere else like the .out file or a completely separate file.
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Mon, Feb 28, 2011 at 2:54 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>> I see, so I should jstack the .META region. I'll do that.
>>>>>>>
>>>>>>> The master log pretty much looks like this: should I grep for
>>>>>>> something specific?
>>>>>>>
>>>>>>> 11/02/28 22:52:56 INFO master.BaseScanner: RegionManager.rootScanner
>>>>>>> scan of 1 row(s) of meta region complete
>>>>>>> 2011-02-28T22:52:57.138+0000: 263241.014: [GC 263241.014: [ParNew:
>>>>>>> 19136K->2112K(19136K), 0.0033350 secs] 378606K->36159
>>>>>>> 6K(641944K) icms_dc=0 , 0.0034330 secs] [Times: user=0.02 sys=0.00,
>>>>>>> real=0.00 secs]
>>>>>>> 2011-02-28T22:53:04.956+0000: 263248.832: [GC 263248.832: [ParNew:
>>>>>>> 19068K->2112K(19136K), 0.0029690 secs] 378552K->36163
>>>>>>> 0K(641944K) icms_dc=0 , 0.0030460 secs] [Times: user=0.02 sys=0.00,
>>>>>>> real=0.00 secs]
>>>>>>> 2011-02-28T22:53:12.664+0000: 263256.540: [GC 263256.541: [ParNew:
>>>>>>> 19136K->2112K(19136K), 0.0037690 secs] 378654K->36164
>>>>>>> 7K(641944K) icms_dc=0 , 0.0038660 secs] [Times: user=0.01 sys=0.00,
>>>>>>> real=0.01 secs]
>>>>>>> 2011-02-28T22:53:20.608+0000: 263264.484: [GC 263264.484: [ParNew:
>>>>>>> 19136K->1810K(19136K), 0.0032600 secs] 378671K->36134
>>>>>>> 6K(641944K) icms_dc=0 , 0.0033470 secs] [Times: user=0.02 sys=0.00,
>>>>>>> real=0.01 secs]
>>>>>>> 2011-02-28T22:53:28.278+0000: 263272.154: [GC 263272.154: [ParNew:
>>>>>>> 18834K->2112K(19136K), 0.0029870 secs] 378370K->36166
>>>>>>> 1K(641944K) icms_dc=0 , 0.0030670 secs] [Times: user=0.01 sys=0.00,
>>>>>>> real=0.01 secs]
>>>>>>> 2011-02-28T22:53:35.919+0000: 263279.795: [GC 263279.795: [ParNew:
>>>>>>> 19136K->1747K(19136K), 0.0037090 secs] 378685K->36129
>>>>>>> 8K(641944K) icms_dc=0 , 0.0037920 secs] [Times: user=0.02 sys=0.00,
>>>>>>> real=0.01 secs]
>>>>>>> 11/02/28 22:53:36 INFO master.ServerManager: 9 region servers, 0 dead,
>>>>>>> average load 760.2222222222222
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 28, 2011 at 2:49 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>>>> Are you able to tell if that call in metaScanner is hanging or it's
>>>>>>>> making multiple calls to the .META. region?
>>>>>>>>
>>>>>>>> If former, then jstack the region server that hosts .META. and see
>>>>>>>> where it's blocked.
>>>>>>>>
>>>>>>>> if latter, then it means your .META. region is slow? Again, what's
>>>>>>>> going on on the RS that hosts .META.?
>>>>>>>>
>>>>>>>> Finally, what's the master's log like during that time?
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Mon, Feb 28, 2011 at 2:41 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>> J-D,
>>>>>>>>>
>>>>>>>>>  Thanks so much for your help so far! I sent disable commands on 4
>>>>>>>>> rather small tables and they got stuck for a long time again, so I
>>>>>>>>> took jstack of the master. From what I can tell, all disableTable
>>>>>>>>> calls are blocked by a meta scanner thread (sample log below). At the
>>>>>>>>> moment there were no other requests to the server at all. How should I
>>>>>>>>> investigate this further? If it helps, here's some background: I have
>>>>>>>>> several datasets, each of them is in a separate table; Our data
>>>>>>>>> pipeline produces a new version of each dataset everyday and only the
>>>>>>>>> lastest version should be used. This is how the data is loaded: for
>>>>>>>>> each dataset 1) run a MR job and outputs HFiles 2) call loadTable.rb
>>>>>>>>> to create a new table 3) disable and drop the previous version. As a
>>>>>>>>> result some calls to load table and drop table would overlap. Please
>>>>>>>>> let me know if something stands out to you as a potential culprit.
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> BTW, I am running Hadoop 0.20.2 with HBase 0.20.6
>>>>>>>>>
>>>>>>>>> Thread 47 (IPC Server handler 13 on 60000):
>>>>>>>>>  State: BLOCKED
>>>>>>>>>  Blocked count: 3801
>>>>>>>>>  Waited count: 72719
>>>>>>>>>  Blocked on java.lang.Object@75ac522c
>>>>>>>>>  Blocked by 27 (RegionManager.metaScanner)
>>>>>>>>>  Stack:
>>>>>>>>>    org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154)
>>>>>>>>>    org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842)
>>>>>>>>>    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>>>>>>>    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>>>    java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>>>>>  State: WAITING
>>>>>>>>>  Blocked count: 1526058
>>>>>>>>>  Waited count: 1488998
>>>>>>>>>  Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0
>>>>>>>>>  Stack:
>>>>>>>>>    java.lang.Object.wait(Native Method)
>>>>>>>>>    java.lang.Object.wait(Object.java:485)
>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>>>>>>>>    $Proxy1.get(Unknown Source)
>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543)
>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192)
>>>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
>>>>>>>>>    org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Feb 25, 2011 at 10:23 AM, Jean-Daniel Cryans
>>>>>>>>> <jd...@apache.org> wrote:
>>>>>>>>>> An hour to disable? That doesn't sound right at all :)
>>>>>>>>>>
>>>>>>>>>> I would approach this problem like I generally do with HBase issue,
>>>>>>>>>> first check the master log for any weirdness regarding my problem (in
>>>>>>>>>> this case, grep for the table name).
>>>>>>>>>>
>>>>>>>>>> Then I would look the region server log(s) of the nodes that were
>>>>>>>>>> hosting regions from that table. You should see the steps taken to
>>>>>>>>>> disable the regions (starting to close, flush, region completely
>>>>>>>>>> closed).
>>>>>>>>>>
>>>>>>>>>> If you are able to do it while it's taking a very long time to
>>>>>>>>>> disable, try to jstack the process the seems to be hanging.
>>>>>>>>>>
>>>>>>>>>> Finally, like I said in another thread, there's a bug in 0.20.6 that
>>>>>>>>>> almost prevent disabling a table (or re-enabling) if any region
>>>>>>>>>> recently split and the parent wasn't cleaned yet from .META. and that
>>>>>>>>>> is fixed in 0.90.1
>>>>>>>>>>
>>>>>>>>>> J-D
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>>>> I think you are right, maybe in the long run I need to re-architect my
>>>>>>>>>>> system so that it doesn't need to create new and delete old tables all
>>>>>>>>>>> the time. In the short term I am having a really hard time with the
>>>>>>>>>>> disabling function, I ran a disable command on a very small table
>>>>>>>>>>> (probably dozen of MBs in size) and are no client using the cluster at
>>>>>>>>>>> all, and that took about 1 hour to complete! The weird thing is on the
>>>>>>>>>>> web UI only the region server carrying the META table has non-zero
>>>>>>>>>>> requests, all other RS have 0 requests the entire time. I would think
>>>>>>>>>>> they should get some request to flush the memstore at least. I *am*
>>>>>>>>>>> using the same RS nodes for some map reduce job at the time and top
>>>>>>>>>>> shows the memory usage is almost full on the META region. Would you
>>>>>>>>>>> have some idea of what I should investigate?
>>>>>>>>>>> Thanks so much.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Disabling a table taking very long time

Posted by Nanheng Wu <na...@gmail.com>.

J-D,

It looks like I don't have that many files:

drwxr-xr-x   - root supergroup          0 2011-02-28 23:21
/hbase/.META./1028785192
-rw-r--r--   3 root supergroup        932 2011-02-03 19:23
/hbase/.META./1028785192/.regioninfo
drwxr-xr-x   - root supergroup          0 2011-03-01 17:13
/hbase/.META./1028785192/info
-rw-r--r--   3 root supergroup   11831330 2011-03-01 17:13
/hbase/.META./1028785192/info/6687606720393313750
drwxr-xr-x   - root supergroup          0 2011-03-01 17:13
/hbase/.META./compaction.dir

So you think compaction on .META. table on the RS could be the
problem? while the log metaScanner is running I saw virtually no logs
on that server though, I did see compactions once the meta scan ended
and region close started to happen. I wonder what the cluster is doing
while the meta scan is going.

I would definitely upgrade in the near future, it's just that I am
afraid upgrading won't guarantee a fix if I don't understand the
nature of the problem more clearly. We also have another test cluster,
running the same versions of everything on cheaper hardware and it
doesn't have this problem.

On Tue, Mar 1, 2011 at 9:52 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
> That long .META. scan might be the culprit, good catch!
>
> If you run this from the hadoop folder:
>
> bin/hadoop dfs -lsr /hbase/.META.
>
> Do you see a ton of files? Let's define "ton" as more than 16 files.
> If so, and I see you have a very high number of regions, then you
> could be hitting this problem:
> https://issues.apache.org/jira/browse/HBASE-3499
>
> In short, every few .META. change will result in a flush and it could
> be that the region server hosting it isn't able to keep up with the
> compactions. It's done so that since append isn't supported in your
> hadoop that you won't lose too much data if the node dies, which is
> really bad when it happens to .META.
>
> In your case it might not be so bad to set your .META.'s MEMSTORE_SIZE
> back to the default value if you're not writing hence not splitting.
> Running the script attached in that jira would do it (don't forget to
> restart HBase after running it).
>
> Lastly, upgrading to HBase 0.90.1 and a hadoop that supports append
> should be a priority.
>
> J-D
>
> On Tue, Mar 1, 2011 at 9:30 AM, Nanheng Wu <na...@gmail.com> wrote:
>> Hi J-D:
>>
>>  I did the scan like you suggested but no splits came up. This kind
>> of makes sense to me, since we write Hfiles using a MR job and then
>> pretty much use HBase for read-only so maybe that's why the regions
>> never got split? I think my slow disable problem is related to this
>> RegionManager.metaScanner. I greped for that in the Masters log and I
>> saw:
>> 11/03/01 15:30:37 INFO master.BaseScanner: RegionManager.metaScanner
>> scanning meta region {server: 10.146.14.38:60020, regionname:
>> .META.,,1, startKey: <>}
>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
>> regionname: .META.,,1, startKey: <>} complete
>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
>> scanning meta region {server: 10.146.14.38:60020, regionname:
>> .META.,,1, startKey: <>}
>> Thread 27 (RegionManager.metaScanner):
>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
>> regionname: .META.,,1, startKey: <>} complete
>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
>> scanning meta region {server: 10.146.14.38:60020, regionname:
>> .META.,,1, startKey: <>}
>> Thread 27 (RegionManager.metaScanner):
>> 11/03/01 16:46:07 INFO master.BaseScanner: RegionManager.metaScanner
>> scan of 8298 row(s) of meta region {server: 10.146.14.38:60020,
>> regionname: .META.,,1, startKey: <>} complete
>>
>> It looks like meta scanner is always running, each time taking 20-30
>> minutes. Here's what I did:
>>
>> 1) disable table -> at 16:28:31, according the log above there's a
>> meta scan in progress.
>>
>> 2 ) the command hangs, no output in .META RS's log
>> at 11/03/01 16:46:57 the client got 11/03/01 16:46:58 DEBUG
>> zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server got
>> 10.146.14.38:60020
>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Found
>> ROOT at 10.146.14.38:60020
>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Cached
>> location for .META.,,1 is 10.146.14.38:60020
>> This corresponds exact to when the meta scan completed. .META RS
>> started to spew some logs about closing regions.
>>
>> 3) Another meta scan started and the client hangs again until it
>> finishes, and disable command returns.The whole thing took 2688.5930
>> seconds
>>
>> What kind of problem does it look like to you J-D? What does this
>> background task metaScanner do?
>>
>> Thanks!
>>
>>
>>
>> On Mon, Feb 28, 2011 at 5:17 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>> I didn't get most of the information I was looking for... but I do
>>> have something to work with. So I asked whether the metaScanner was
>>> making progress or not, and from that log I think I see it does:
>>>
>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>>> Rowscanned=261, rowsOffline=3
>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>>>> regions to be disabled from myTable
>>>
>>> Only 3 rows are offline, whereas we expect all the region to be
>>> offline. I expect you see those lines a lot? If so, one thing I'd like
>>> to see is the result of this command:
>>>
>>> scan '.META.', {STARTROW => "myTable,,", LIMIT => 261}
>>>
>>> It's going to be big. Then grep in the result for the string SPLIT,
>>> and please post back here the lines that match.
>>>
>>> J-D
>>>
>>> On Mon, Feb 28, 2011 at 5:04 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>> I issued two more disable commands and looked at the .META RS. The
>>>> last log entry looks like:
>>>>
>>>> 11/03/01 00:29:14 INFO regionserver.HLog: Roll
>>>> /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog
>>>> .dat.1298935752660, entries=13527, calcsize=3839046, filesize=2560143.
>>>> New hlog /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog.dat.1298939354072
>>>>
>>>> I issued the command at: 00:34:59, the client debug message has
>>>> nothing but "11/03/01 00:34:59 DEBUG zookeeper.ClientCnxn: Got ping
>>>> response for sessionid:0x12e6e938e9c000b after 90ms" until 00:47:19.
>>>> Then all of a sudden it outputs the following:
>>>>
>>>> 0:47:28 DEBUG client.HConnectionManager$TableServers: Cache hit for
>>>> row <> in tableName .META.: location server 10.146.14.38:60020,
>>>> location region name .META.,,1
>>>> 11/03/01 00:47:28 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode
>>>> /hbase/root-region-server got 10.146.14.38:60020
>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Found
>>>> ROOT at 10.146.14.38:60020
>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Cached
>>>> location for .META.,,1 is 10.146.14.38:60020
>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>>> Rowscanned=261, rowsOffline=3
>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>>>> regions to be disabled from myTable
>>>> 11/03/01 00:47:31 DEBUG client.HBaseAdmin: Wake. Waiting for all
>>>> regions to be disabled from myTable
>>>>
>>>> Here's the full thread dump of the .META RS from 00:34:59 to  00:47:31
>>>> http://pastebin.com/Ywwx8afQ
>>>>
>>>> I hope I compiled some useful information here. I am pretty lost at
>>>> this point, I would really appreciate your help!
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>> On Mon, Feb 28, 2011 at 2:58 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>> Ok so no debug from the disable command? BTW you should output your GC
>>>>> somewhere else like the .out file or a completely separate file.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Mon, Feb 28, 2011 at 2:54 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>> I see, so I should jstack the .META region. I'll do that.
>>>>>>
>>>>>> The master log pretty much looks like this: should I grep for
>>>>>> something specific?
>>>>>>
>>>>>> 11/02/28 22:52:56 INFO master.BaseScanner: RegionManager.rootScanner
>>>>>> scan of 1 row(s) of meta region complete
>>>>>> 2011-02-28T22:52:57.138+0000: 263241.014: [GC 263241.014: [ParNew:
>>>>>> 19136K->2112K(19136K), 0.0033350 secs] 378606K->36159
>>>>>> 6K(641944K) icms_dc=0 , 0.0034330 secs] [Times: user=0.02 sys=0.00,
>>>>>> real=0.00 secs]
>>>>>> 2011-02-28T22:53:04.956+0000: 263248.832: [GC 263248.832: [ParNew:
>>>>>> 19068K->2112K(19136K), 0.0029690 secs] 378552K->36163
>>>>>> 0K(641944K) icms_dc=0 , 0.0030460 secs] [Times: user=0.02 sys=0.00,
>>>>>> real=0.00 secs]
>>>>>> 2011-02-28T22:53:12.664+0000: 263256.540: [GC 263256.541: [ParNew:
>>>>>> 19136K->2112K(19136K), 0.0037690 secs] 378654K->36164
>>>>>> 7K(641944K) icms_dc=0 , 0.0038660 secs] [Times: user=0.01 sys=0.00,
>>>>>> real=0.01 secs]
>>>>>> 2011-02-28T22:53:20.608+0000: 263264.484: [GC 263264.484: [ParNew:
>>>>>> 19136K->1810K(19136K), 0.0032600 secs] 378671K->36134
>>>>>> 6K(641944K) icms_dc=0 , 0.0033470 secs] [Times: user=0.02 sys=0.00,
>>>>>> real=0.01 secs]
>>>>>> 2011-02-28T22:53:28.278+0000: 263272.154: [GC 263272.154: [ParNew:
>>>>>> 18834K->2112K(19136K), 0.0029870 secs] 378370K->36166
>>>>>> 1K(641944K) icms_dc=0 , 0.0030670 secs] [Times: user=0.01 sys=0.00,
>>>>>> real=0.01 secs]
>>>>>> 2011-02-28T22:53:35.919+0000: 263279.795: [GC 263279.795: [ParNew:
>>>>>> 19136K->1747K(19136K), 0.0037090 secs] 378685K->36129
>>>>>> 8K(641944K) icms_dc=0 , 0.0037920 secs] [Times: user=0.02 sys=0.00,
>>>>>> real=0.01 secs]
>>>>>> 11/02/28 22:53:36 INFO master.ServerManager: 9 region servers, 0 dead,
>>>>>> average load 760.2222222222222
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 28, 2011 at 2:49 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>>> Are you able to tell if that call in metaScanner is hanging or it's
>>>>>>> making multiple calls to the .META. region?
>>>>>>>
>>>>>>> If former, then jstack the region server that hosts .META. and see
>>>>>>> where it's blocked.
>>>>>>>
>>>>>>> if latter, then it means your .META. region is slow? Again, what's
>>>>>>> going on on the RS that hosts .META.?
>>>>>>>
>>>>>>> Finally, what's the master's log like during that time?
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Mon, Feb 28, 2011 at 2:41 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>> J-D,
>>>>>>>>
>>>>>>>>  Thanks so much for your help so far! I sent disable commands on 4
>>>>>>>> rather small tables and they got stuck for a long time again, so I
>>>>>>>> took jstack of the master. From what I can tell, all disableTable
>>>>>>>> calls are blocked by a meta scanner thread (sample log below). At the
>>>>>>>> moment there were no other requests to the server at all. How should I
>>>>>>>> investigate this further? If it helps, here's some background: I have
>>>>>>>> several datasets, each of them is in a separate table; Our data
>>>>>>>> pipeline produces a new version of each dataset everyday and only the
>>>>>>>> lastest version should be used. This is how the data is loaded: for
>>>>>>>> each dataset 1) run a MR job and outputs HFiles 2) call loadTable.rb
>>>>>>>> to create a new table 3) disable and drop the previous version. As a
>>>>>>>> result some calls to load table and drop table would overlap. Please
>>>>>>>> let me know if something stands out to you as a potential culprit.
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> BTW, I am running Hadoop 0.20.2 with HBase 0.20.6
>>>>>>>>
>>>>>>>> Thread 47 (IPC Server handler 13 on 60000):
>>>>>>>>  State: BLOCKED
>>>>>>>>  Blocked count: 3801
>>>>>>>>  Waited count: 72719
>>>>>>>>  Blocked on java.lang.Object@75ac522c
>>>>>>>>  Blocked by 27 (RegionManager.metaScanner)
>>>>>>>>  Stack:
>>>>>>>>    org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154)
>>>>>>>>    org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842)
>>>>>>>>    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>>>>>>    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>>    java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>>>>
>>>>>>>>
>>>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>>>>  State: WAITING
>>>>>>>>  Blocked count: 1526058
>>>>>>>>  Waited count: 1488998
>>>>>>>>  Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0
>>>>>>>>  Stack:
>>>>>>>>    java.lang.Object.wait(Native Method)
>>>>>>>>    java.lang.Object.wait(Object.java:485)
>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>>>>>>>    $Proxy1.get(Unknown Source)
>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543)
>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192)
>>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
>>>>>>>>    org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Feb 25, 2011 at 10:23 AM, Jean-Daniel Cryans
>>>>>>>> <jd...@apache.org> wrote:
>>>>>>>>> An hour to disable? That doesn't sound right at all :)
>>>>>>>>>
>>>>>>>>> I would approach this problem like I generally do with HBase issue,
>>>>>>>>> first check the master log for any weirdness regarding my problem (in
>>>>>>>>> this case, grep for the table name).
>>>>>>>>>
>>>>>>>>> Then I would look the region server log(s) of the nodes that were
>>>>>>>>> hosting regions from that table. You should see the steps taken to
>>>>>>>>> disable the regions (starting to close, flush, region completely
>>>>>>>>> closed).
>>>>>>>>>
>>>>>>>>> If you are able to do it while it's taking a very long time to
>>>>>>>>> disable, try to jstack the process the seems to be hanging.
>>>>>>>>>
>>>>>>>>> Finally, like I said in another thread, there's a bug in 0.20.6 that
>>>>>>>>> almost prevent disabling a table (or re-enabling) if any region
>>>>>>>>> recently split and the parent wasn't cleaned yet from .META. and that
>>>>>>>>> is fixed in 0.90.1
>>>>>>>>>
>>>>>>>>> J-D
>>>>>>>>>
>>>>>>>>> On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>>> I think you are right, maybe in the long run I need to re-architect my
>>>>>>>>>> system so that it doesn't need to create new and delete old tables all
>>>>>>>>>> the time. In the short term I am having a really hard time with the
>>>>>>>>>> disabling function, I ran a disable command on a very small table
>>>>>>>>>> (probably dozen of MBs in size) and are no client using the cluster at
>>>>>>>>>> all, and that took about 1 hour to complete! The weird thing is on the
>>>>>>>>>> web UI only the region server carrying the META table has non-zero
>>>>>>>>>> requests, all other RS have 0 requests the entire time. I would think
>>>>>>>>>> they should get some request to flush the memstore at least. I *am*
>>>>>>>>>> using the same RS nodes for some map reduce job at the time and top
>>>>>>>>>> shows the memory usage is almost full on the META region. Would you
>>>>>>>>>> have some idea of what I should investigate?
>>>>>>>>>> Thanks so much.
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Disabling a table taking very long time

Posted by Jean-Daniel Cryans <jd...@apache.org>.

That long .META. scan might be the culprit, good catch!

If you run this from the hadoop folder:

bin/hadoop dfs -lsr /hbase/.META.

Do you see a ton of files? Let's define "ton" as more than 16 files.
If so, and I see you have a very high number of regions, then you
could be hitting this problem:
https://issues.apache.org/jira/browse/HBASE-3499

In short, every few .META. change will result in a flush and it could
be that the region server hosting it isn't able to keep up with the
compactions. It's done so that since append isn't supported in your
hadoop that you won't lose too much data if the node dies, which is
really bad when it happens to .META.

In your case it might not be so bad to set your .META.'s MEMSTORE_SIZE
back to the default value if you're not writing hence not splitting.
Running the script attached in that jira would do it (don't forget to
restart HBase after running it).

Lastly, upgrading to HBase 0.90.1 and a hadoop that supports append
should be a priority.

J-D

On Tue, Mar 1, 2011 at 9:30 AM, Nanheng Wu <na...@gmail.com> wrote:
> Hi J-D:
>
>  I did the scan like you suggested but no splits came up. This kind
> of makes sense to me, since we write Hfiles using a MR job and then
> pretty much use HBase for read-only so maybe that's why the regions
> never got split? I think my slow disable problem is related to this
> RegionManager.metaScanner. I greped for that in the Masters log and I
> saw:
> 11/03/01 15:30:37 INFO master.BaseScanner: RegionManager.metaScanner
> scanning meta region {server: 10.146.14.38:60020, regionname:
> .META.,,1, startKey: <>}
> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
> regionname: .META.,,1, startKey: <>} complete
> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
> scanning meta region {server: 10.146.14.38:60020, regionname:
> .META.,,1, startKey: <>}
> Thread 27 (RegionManager.metaScanner):
> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
> regionname: .META.,,1, startKey: <>} complete
> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
> scanning meta region {server: 10.146.14.38:60020, regionname:
> .META.,,1, startKey: <>}
> Thread 27 (RegionManager.metaScanner):
> 11/03/01 16:46:07 INFO master.BaseScanner: RegionManager.metaScanner
> scan of 8298 row(s) of meta region {server: 10.146.14.38:60020,
> regionname: .META.,,1, startKey: <>} complete
>
> It looks like meta scanner is always running, each time taking 20-30
> minutes. Here's what I did:
>
> 1) disable table -> at 16:28:31, according the log above there's a
> meta scan in progress.
>
> 2 ) the command hangs, no output in .META RS's log
> at 11/03/01 16:46:57 the client got 11/03/01 16:46:58 DEBUG
> zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server got
> 10.146.14.38:60020
> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Found
> ROOT at 10.146.14.38:60020
> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Cached
> location for .META.,,1 is 10.146.14.38:60020
> This corresponds exact to when the meta scan completed. .META RS
> started to spew some logs about closing regions.
>
> 3) Another meta scan started and the client hangs again until it
> finishes, and disable command returns.The whole thing took 2688.5930
> seconds
>
> What kind of problem does it look like to you J-D? What does this
> background task metaScanner do?
>
> Thanks!
>
>
>
> On Mon, Feb 28, 2011 at 5:17 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> I didn't get most of the information I was looking for... but I do
>> have something to work with. So I asked whether the metaScanner was
>> making progress or not, and from that log I think I see it does:
>>
>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>> Rowscanned=261, rowsOffline=3
>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>>> regions to be disabled from myTable
>>
>> Only 3 rows are offline, whereas we expect all the region to be
>> offline. I expect you see those lines a lot? If so, one thing I'd like
>> to see is the result of this command:
>>
>> scan '.META.', {STARTROW => "myTable,,", LIMIT => 261}
>>
>> It's going to be big. Then grep in the result for the string SPLIT,
>> and please post back here the lines that match.
>>
>> J-D
>>
>> On Mon, Feb 28, 2011 at 5:04 PM, Nanheng Wu <na...@gmail.com> wrote:
>>> I issued two more disable commands and looked at the .META RS. The
>>> last log entry looks like:
>>>
>>> 11/03/01 00:29:14 INFO regionserver.HLog: Roll
>>> /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog
>>> .dat.1298935752660, entries=13527, calcsize=3839046, filesize=2560143.
>>> New hlog /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog.dat.1298939354072
>>>
>>> I issued the command at: 00:34:59, the client debug message has
>>> nothing but "11/03/01 00:34:59 DEBUG zookeeper.ClientCnxn: Got ping
>>> response for sessionid:0x12e6e938e9c000b after 90ms" until 00:47:19.
>>> Then all of a sudden it outputs the following:
>>>
>>> 0:47:28 DEBUG client.HConnectionManager$TableServers: Cache hit for
>>> row <> in tableName .META.: location server 10.146.14.38:60020,
>>> location region name .META.,,1
>>> 11/03/01 00:47:28 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode
>>> /hbase/root-region-server got 10.146.14.38:60020
>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Found
>>> ROOT at 10.146.14.38:60020
>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Cached
>>> location for .META.,,1 is 10.146.14.38:60020
>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>> Rowscanned=261, rowsOffline=3
>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>>> regions to be disabled from myTable
>>> 11/03/01 00:47:31 DEBUG client.HBaseAdmin: Wake. Waiting for all
>>> regions to be disabled from myTable
>>>
>>> Here's the full thread dump of the .META RS from 00:34:59 to  00:47:31
>>> http://pastebin.com/Ywwx8afQ
>>>
>>> I hope I compiled some useful information here. I am pretty lost at
>>> this point, I would really appreciate your help!
>>>
>>> Thanks!
>>>
>>>
>>>
>>> On Mon, Feb 28, 2011 at 2:58 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>> Ok so no debug from the disable command? BTW you should output your GC
>>>> somewhere else like the .out file or a completely separate file.
>>>>
>>>> J-D
>>>>
>>>> On Mon, Feb 28, 2011 at 2:54 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>> I see, so I should jstack the .META region. I'll do that.
>>>>>
>>>>> The master log pretty much looks like this: should I grep for
>>>>> something specific?
>>>>>
>>>>> 11/02/28 22:52:56 INFO master.BaseScanner: RegionManager.rootScanner
>>>>> scan of 1 row(s) of meta region complete
>>>>> 2011-02-28T22:52:57.138+0000: 263241.014: [GC 263241.014: [ParNew:
>>>>> 19136K->2112K(19136K), 0.0033350 secs] 378606K->36159
>>>>> 6K(641944K) icms_dc=0 , 0.0034330 secs] [Times: user=0.02 sys=0.00,
>>>>> real=0.00 secs]
>>>>> 2011-02-28T22:53:04.956+0000: 263248.832: [GC 263248.832: [ParNew:
>>>>> 19068K->2112K(19136K), 0.0029690 secs] 378552K->36163
>>>>> 0K(641944K) icms_dc=0 , 0.0030460 secs] [Times: user=0.02 sys=0.00,
>>>>> real=0.00 secs]
>>>>> 2011-02-28T22:53:12.664+0000: 263256.540: [GC 263256.541: [ParNew:
>>>>> 19136K->2112K(19136K), 0.0037690 secs] 378654K->36164
>>>>> 7K(641944K) icms_dc=0 , 0.0038660 secs] [Times: user=0.01 sys=0.00,
>>>>> real=0.01 secs]
>>>>> 2011-02-28T22:53:20.608+0000: 263264.484: [GC 263264.484: [ParNew:
>>>>> 19136K->1810K(19136K), 0.0032600 secs] 378671K->36134
>>>>> 6K(641944K) icms_dc=0 , 0.0033470 secs] [Times: user=0.02 sys=0.00,
>>>>> real=0.01 secs]
>>>>> 2011-02-28T22:53:28.278+0000: 263272.154: [GC 263272.154: [ParNew:
>>>>> 18834K->2112K(19136K), 0.0029870 secs] 378370K->36166
>>>>> 1K(641944K) icms_dc=0 , 0.0030670 secs] [Times: user=0.01 sys=0.00,
>>>>> real=0.01 secs]
>>>>> 2011-02-28T22:53:35.919+0000: 263279.795: [GC 263279.795: [ParNew:
>>>>> 19136K->1747K(19136K), 0.0037090 secs] 378685K->36129
>>>>> 8K(641944K) icms_dc=0 , 0.0037920 secs] [Times: user=0.02 sys=0.00,
>>>>> real=0.01 secs]
>>>>> 11/02/28 22:53:36 INFO master.ServerManager: 9 region servers, 0 dead,
>>>>> average load 760.2222222222222
>>>>>
>>>>>
>>>>> On Mon, Feb 28, 2011 at 2:49 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>> Are you able to tell if that call in metaScanner is hanging or it's
>>>>>> making multiple calls to the .META. region?
>>>>>>
>>>>>> If former, then jstack the region server that hosts .META. and see
>>>>>> where it's blocked.
>>>>>>
>>>>>> if latter, then it means your .META. region is slow? Again, what's
>>>>>> going on on the RS that hosts .META.?
>>>>>>
>>>>>> Finally, what's the master's log like during that time?
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Mon, Feb 28, 2011 at 2:41 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>> J-D,
>>>>>>>
>>>>>>>  Thanks so much for your help so far! I sent disable commands on 4
>>>>>>> rather small tables and they got stuck for a long time again, so I
>>>>>>> took jstack of the master. From what I can tell, all disableTable
>>>>>>> calls are blocked by a meta scanner thread (sample log below). At the
>>>>>>> moment there were no other requests to the server at all. How should I
>>>>>>> investigate this further? If it helps, here's some background: I have
>>>>>>> several datasets, each of them is in a separate table; Our data
>>>>>>> pipeline produces a new version of each dataset everyday and only the
>>>>>>> lastest version should be used. This is how the data is loaded: for
>>>>>>> each dataset 1) run a MR job and outputs HFiles 2) call loadTable.rb
>>>>>>> to create a new table 3) disable and drop the previous version. As a
>>>>>>> result some calls to load table and drop table would overlap. Please
>>>>>>> let me know if something stands out to you as a potential culprit.
>>>>>>> Thanks!
>>>>>>>
>>>>>>> BTW, I am running Hadoop 0.20.2 with HBase 0.20.6
>>>>>>>
>>>>>>> Thread 47 (IPC Server handler 13 on 60000):
>>>>>>>  State: BLOCKED
>>>>>>>  Blocked count: 3801
>>>>>>>  Waited count: 72719
>>>>>>>  Blocked on java.lang.Object@75ac522c
>>>>>>>  Blocked by 27 (RegionManager.metaScanner)
>>>>>>>  Stack:
>>>>>>>    org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154)
>>>>>>>    org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842)
>>>>>>>    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>>>>>    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>    java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>>>
>>>>>>>
>>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>>>  State: WAITING
>>>>>>>  Blocked count: 1526058
>>>>>>>  Waited count: 1488998
>>>>>>>  Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0
>>>>>>>  Stack:
>>>>>>>    java.lang.Object.wait(Native Method)
>>>>>>>    java.lang.Object.wait(Object.java:485)
>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>>>>>>    $Proxy1.get(Unknown Source)
>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543)
>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192)
>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
>>>>>>>    org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Feb 25, 2011 at 10:23 AM, Jean-Daniel Cryans
>>>>>>> <jd...@apache.org> wrote:
>>>>>>>> An hour to disable? That doesn't sound right at all :)
>>>>>>>>
>>>>>>>> I would approach this problem like I generally do with HBase issue,
>>>>>>>> first check the master log for any weirdness regarding my problem (in
>>>>>>>> this case, grep for the table name).
>>>>>>>>
>>>>>>>> Then I would look the region server log(s) of the nodes that were
>>>>>>>> hosting regions from that table. You should see the steps taken to
>>>>>>>> disable the regions (starting to close, flush, region completely
>>>>>>>> closed).
>>>>>>>>
>>>>>>>> If you are able to do it while it's taking a very long time to
>>>>>>>> disable, try to jstack the process the seems to be hanging.
>>>>>>>>
>>>>>>>> Finally, like I said in another thread, there's a bug in 0.20.6 that
>>>>>>>> almost prevent disabling a table (or re-enabling) if any region
>>>>>>>> recently split and the parent wasn't cleaned yet from .META. and that
>>>>>>>> is fixed in 0.90.1
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>>> I think you are right, maybe in the long run I need to re-architect my
>>>>>>>>> system so that it doesn't need to create new and delete old tables all
>>>>>>>>> the time. In the short term I am having a really hard time with the
>>>>>>>>> disabling function, I ran a disable command on a very small table
>>>>>>>>> (probably dozen of MBs in size) and are no client using the cluster at
>>>>>>>>> all, and that took about 1 hour to complete! The weird thing is on the
>>>>>>>>> web UI only the region server carrying the META table has non-zero
>>>>>>>>> requests, all other RS have 0 requests the entire time. I would think
>>>>>>>>> they should get some request to flush the memstore at least. I *am*
>>>>>>>>> using the same RS nodes for some map reduce job at the time and top
>>>>>>>>> shows the memory usage is almost full on the META region. Would you
>>>>>>>>> have some idea of what I should investigate?
>>>>>>>>> Thanks so much.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Disabling a table taking very long time

Posted by Nanheng Wu <na...@gmail.com>.

Hi J-D:

  I did the scan like you suggested but no splits came up. This kind
of makes sense to me, since we write Hfiles using a MR job and then
pretty much use HBase for read-only so maybe that's why the regions
never got split? I think my slow disable problem is related to this
RegionManager.metaScanner. I greped for that in the Masters log and I
saw:
11/03/01 15:30:37 INFO master.BaseScanner: RegionManager.metaScanner
scanning meta region {server: 10.146.14.38:60020, regionname:
.META.,,1, startKey: <>}
11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
regionname: .META.,,1, startKey: <>} complete
11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
scanning meta region {server: 10.146.14.38:60020, regionname:
.META.,,1, startKey: <>}
Thread 27 (RegionManager.metaScanner):
11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
regionname: .META.,,1, startKey: <>} complete
11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
scanning meta region {server: 10.146.14.38:60020, regionname:
.META.,,1, startKey: <>}
Thread 27 (RegionManager.metaScanner):
11/03/01 16:46:07 INFO master.BaseScanner: RegionManager.metaScanner
scan of 8298 row(s) of meta region {server: 10.146.14.38:60020,
regionname: .META.,,1, startKey: <>} complete

It looks like meta scanner is always running, each time taking 20-30
minutes. Here's what I did:

1) disable table -> at 16:28:31, according the log above there's a
meta scan in progress.

2 ) the command hangs, no output in .META RS's log
at 11/03/01 16:46:57 the client got 11/03/01 16:46:58 DEBUG
zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server got
10.146.14.38:60020
11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Found
ROOT at 10.146.14.38:60020
11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Cached
location for .META.,,1 is 10.146.14.38:60020
This corresponds exact to when the meta scan completed. .META RS
started to spew some logs about closing regions.

3) Another meta scan started and the client hangs again until it
finishes, and disable command returns.The whole thing took 2688.5930
seconds

What kind of problem does it look like to you J-D? What does this
background task metaScanner do?

Thanks!



On Mon, Feb 28, 2011 at 5:17 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
> I didn't get most of the information I was looking for... but I do
> have something to work with. So I asked whether the metaScanner was
> making progress or not, and from that log I think I see it does:
>
>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>> Rowscanned=261, rowsOffline=3
>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>> regions to be disabled from myTable
>
> Only 3 rows are offline, whereas we expect all the region to be
> offline. I expect you see those lines a lot? If so, one thing I'd like
> to see is the result of this command:
>
> scan '.META.', {STARTROW => "myTable,,", LIMIT => 261}
>
> It's going to be big. Then grep in the result for the string SPLIT,
> and please post back here the lines that match.
>
> J-D
>
> On Mon, Feb 28, 2011 at 5:04 PM, Nanheng Wu <na...@gmail.com> wrote:
>> I issued two more disable commands and looked at the .META RS. The
>> last log entry looks like:
>>
>> 11/03/01 00:29:14 INFO regionserver.HLog: Roll
>> /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog
>> .dat.1298935752660, entries=13527, calcsize=3839046, filesize=2560143.
>> New hlog /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog.dat.1298939354072
>>
>> I issued the command at: 00:34:59, the client debug message has
>> nothing but "11/03/01 00:34:59 DEBUG zookeeper.ClientCnxn: Got ping
>> response for sessionid:0x12e6e938e9c000b after 90ms" until 00:47:19.
>> Then all of a sudden it outputs the following:
>>
>> 0:47:28 DEBUG client.HConnectionManager$TableServers: Cache hit for
>> row <> in tableName .META.: location server 10.146.14.38:60020,
>> location region name .META.,,1
>> 11/03/01 00:47:28 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode
>> /hbase/root-region-server got 10.146.14.38:60020
>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Found
>> ROOT at 10.146.14.38:60020
>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Cached
>> location for .META.,,1 is 10.146.14.38:60020
>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>> Rowscanned=261, rowsOffline=3
>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
>> regions to be disabled from myTable
>> 11/03/01 00:47:31 DEBUG client.HBaseAdmin: Wake. Waiting for all
>> regions to be disabled from myTable
>>
>> Here's the full thread dump of the .META RS from 00:34:59 to  00:47:31
>> http://pastebin.com/Ywwx8afQ
>>
>> I hope I compiled some useful information here. I am pretty lost at
>> this point, I would really appreciate your help!
>>
>> Thanks!
>>
>>
>>
>> On Mon, Feb 28, 2011 at 2:58 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>> Ok so no debug from the disable command? BTW you should output your GC
>>> somewhere else like the .out file or a completely separate file.
>>>
>>> J-D
>>>
>>> On Mon, Feb 28, 2011 at 2:54 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>> I see, so I should jstack the .META region. I'll do that.
>>>>
>>>> The master log pretty much looks like this: should I grep for
>>>> something specific?
>>>>
>>>> 11/02/28 22:52:56 INFO master.BaseScanner: RegionManager.rootScanner
>>>> scan of 1 row(s) of meta region complete
>>>> 2011-02-28T22:52:57.138+0000: 263241.014: [GC 263241.014: [ParNew:
>>>> 19136K->2112K(19136K), 0.0033350 secs] 378606K->36159
>>>> 6K(641944K) icms_dc=0 , 0.0034330 secs] [Times: user=0.02 sys=0.00,
>>>> real=0.00 secs]
>>>> 2011-02-28T22:53:04.956+0000: 263248.832: [GC 263248.832: [ParNew:
>>>> 19068K->2112K(19136K), 0.0029690 secs] 378552K->36163
>>>> 0K(641944K) icms_dc=0 , 0.0030460 secs] [Times: user=0.02 sys=0.00,
>>>> real=0.00 secs]
>>>> 2011-02-28T22:53:12.664+0000: 263256.540: [GC 263256.541: [ParNew:
>>>> 19136K->2112K(19136K), 0.0037690 secs] 378654K->36164
>>>> 7K(641944K) icms_dc=0 , 0.0038660 secs] [Times: user=0.01 sys=0.00,
>>>> real=0.01 secs]
>>>> 2011-02-28T22:53:20.608+0000: 263264.484: [GC 263264.484: [ParNew:
>>>> 19136K->1810K(19136K), 0.0032600 secs] 378671K->36134
>>>> 6K(641944K) icms_dc=0 , 0.0033470 secs] [Times: user=0.02 sys=0.00,
>>>> real=0.01 secs]
>>>> 2011-02-28T22:53:28.278+0000: 263272.154: [GC 263272.154: [ParNew:
>>>> 18834K->2112K(19136K), 0.0029870 secs] 378370K->36166
>>>> 1K(641944K) icms_dc=0 , 0.0030670 secs] [Times: user=0.01 sys=0.00,
>>>> real=0.01 secs]
>>>> 2011-02-28T22:53:35.919+0000: 263279.795: [GC 263279.795: [ParNew:
>>>> 19136K->1747K(19136K), 0.0037090 secs] 378685K->36129
>>>> 8K(641944K) icms_dc=0 , 0.0037920 secs] [Times: user=0.02 sys=0.00,
>>>> real=0.01 secs]
>>>> 11/02/28 22:53:36 INFO master.ServerManager: 9 region servers, 0 dead,
>>>> average load 760.2222222222222
>>>>
>>>>
>>>> On Mon, Feb 28, 2011 at 2:49 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>> Are you able to tell if that call in metaScanner is hanging or it's
>>>>> making multiple calls to the .META. region?
>>>>>
>>>>> If former, then jstack the region server that hosts .META. and see
>>>>> where it's blocked.
>>>>>
>>>>> if latter, then it means your .META. region is slow? Again, what's
>>>>> going on on the RS that hosts .META.?
>>>>>
>>>>> Finally, what's the master's log like during that time?
>>>>>
>>>>> J-D
>>>>>
>>>>> On Mon, Feb 28, 2011 at 2:41 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>> J-D,
>>>>>>
>>>>>>  Thanks so much for your help so far! I sent disable commands on 4
>>>>>> rather small tables and they got stuck for a long time again, so I
>>>>>> took jstack of the master. From what I can tell, all disableTable
>>>>>> calls are blocked by a meta scanner thread (sample log below). At the
>>>>>> moment there were no other requests to the server at all. How should I
>>>>>> investigate this further? If it helps, here's some background: I have
>>>>>> several datasets, each of them is in a separate table; Our data
>>>>>> pipeline produces a new version of each dataset everyday and only the
>>>>>> lastest version should be used. This is how the data is loaded: for
>>>>>> each dataset 1) run a MR job and outputs HFiles 2) call loadTable.rb
>>>>>> to create a new table 3) disable and drop the previous version. As a
>>>>>> result some calls to load table and drop table would overlap. Please
>>>>>> let me know if something stands out to you as a potential culprit.
>>>>>> Thanks!
>>>>>>
>>>>>> BTW, I am running Hadoop 0.20.2 with HBase 0.20.6
>>>>>>
>>>>>> Thread 47 (IPC Server handler 13 on 60000):
>>>>>>  State: BLOCKED
>>>>>>  Blocked count: 3801
>>>>>>  Waited count: 72719
>>>>>>  Blocked on java.lang.Object@75ac522c
>>>>>>  Blocked by 27 (RegionManager.metaScanner)
>>>>>>  Stack:
>>>>>>    org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154)
>>>>>>    org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842)
>>>>>>    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>>>>    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>    java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>>>>>    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>>
>>>>>>
>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>>  State: WAITING
>>>>>>  Blocked count: 1526058
>>>>>>  Waited count: 1488998
>>>>>>  Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0
>>>>>>  Stack:
>>>>>>    java.lang.Object.wait(Native Method)
>>>>>>    java.lang.Object.wait(Object.java:485)
>>>>>>    org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>>>>>    $Proxy1.get(Unknown Source)
>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543)
>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192)
>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
>>>>>>    org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 25, 2011 at 10:23 AM, Jean-Daniel Cryans
>>>>>> <jd...@apache.org> wrote:
>>>>>>> An hour to disable? That doesn't sound right at all :)
>>>>>>>
>>>>>>> I would approach this problem like I generally do with HBase issue,
>>>>>>> first check the master log for any weirdness regarding my problem (in
>>>>>>> this case, grep for the table name).
>>>>>>>
>>>>>>> Then I would look the region server log(s) of the nodes that were
>>>>>>> hosting regions from that table. You should see the steps taken to
>>>>>>> disable the regions (starting to close, flush, region completely
>>>>>>> closed).
>>>>>>>
>>>>>>> If you are able to do it while it's taking a very long time to
>>>>>>> disable, try to jstack the process the seems to be hanging.
>>>>>>>
>>>>>>> Finally, like I said in another thread, there's a bug in 0.20.6 that
>>>>>>> almost prevent disabling a table (or re-enabling) if any region
>>>>>>> recently split and the parent wasn't cleaned yet from .META. and that
>>>>>>> is fixed in 0.90.1
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>>> I think you are right, maybe in the long run I need to re-architect my
>>>>>>>> system so that it doesn't need to create new and delete old tables all
>>>>>>>> the time. In the short term I am having a really hard time with the
>>>>>>>> disabling function, I ran a disable command on a very small table
>>>>>>>> (probably dozen of MBs in size) and are no client using the cluster at
>>>>>>>> all, and that took about 1 hour to complete! The weird thing is on the
>>>>>>>> web UI only the region server carrying the META table has non-zero
>>>>>>>> requests, all other RS have 0 requests the entire time. I would think
>>>>>>>> they should get some request to flush the memstore at least. I *am*
>>>>>>>> using the same RS nodes for some map reduce job at the time and top
>>>>>>>> shows the memory usage is almost full on the META region. Would you
>>>>>>>> have some idea of what I should investigate?
>>>>>>>> Thanks so much.
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Disabling a table taking very long time

Posted by Jean-Daniel Cryans <jd...@apache.org>.

I didn't get most of the information I was looking for... but I do
have something to work with. So I asked whether the metaScanner was
making progress or not, and from that log I think I see it does:

> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
> Rowscanned=261, rowsOffline=3
> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
> regions to be disabled from myTable

Only 3 rows are offline, whereas we expect all the region to be
offline. I expect you see those lines a lot? If so, one thing I'd like
to see is the result of this command:

scan '.META.', {STARTROW => "myTable,,", LIMIT => 261}

It's going to be big. Then grep in the result for the string SPLIT,
and please post back here the lines that match.

J-D

On Mon, Feb 28, 2011 at 5:04 PM, Nanheng Wu <na...@gmail.com> wrote:
> I issued two more disable commands and looked at the .META RS. The
> last log entry looks like:
>
> 11/03/01 00:29:14 INFO regionserver.HLog: Roll
> /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog
> .dat.1298935752660, entries=13527, calcsize=3839046, filesize=2560143.
> New hlog /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog.dat.1298939354072
>
> I issued the command at: 00:34:59, the client debug message has
> nothing but "11/03/01 00:34:59 DEBUG zookeeper.ClientCnxn: Got ping
> response for sessionid:0x12e6e938e9c000b after 90ms" until 00:47:19.
> Then all of a sudden it outputs the following:
>
> 0:47:28 DEBUG client.HConnectionManager$TableServers: Cache hit for
> row <> in tableName .META.: location server 10.146.14.38:60020,
> location region name .META.,,1
> 11/03/01 00:47:28 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode
> /hbase/root-region-server got 10.146.14.38:60020
> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Found
> ROOT at 10.146.14.38:60020
> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Cached
> location for .META.,,1 is 10.146.14.38:60020
> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
> Rowscanned=261, rowsOffline=3
> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
> regions to be disabled from myTable
> 11/03/01 00:47:31 DEBUG client.HBaseAdmin: Wake. Waiting for all
> regions to be disabled from myTable
>
> Here's the full thread dump of the .META RS from 00:34:59 to  00:47:31
> http://pastebin.com/Ywwx8afQ
>
> I hope I compiled some useful information here. I am pretty lost at
> this point, I would really appreciate your help!
>
> Thanks!
>
>
>
> On Mon, Feb 28, 2011 at 2:58 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> Ok so no debug from the disable command? BTW you should output your GC
>> somewhere else like the .out file or a completely separate file.
>>
>> J-D
>>
>> On Mon, Feb 28, 2011 at 2:54 PM, Nanheng Wu <na...@gmail.com> wrote:
>>> I see, so I should jstack the .META region. I'll do that.
>>>
>>> The master log pretty much looks like this: should I grep for
>>> something specific?
>>>
>>> 11/02/28 22:52:56 INFO master.BaseScanner: RegionManager.rootScanner
>>> scan of 1 row(s) of meta region complete
>>> 2011-02-28T22:52:57.138+0000: 263241.014: [GC 263241.014: [ParNew:
>>> 19136K->2112K(19136K), 0.0033350 secs] 378606K->36159
>>> 6K(641944K) icms_dc=0 , 0.0034330 secs] [Times: user=0.02 sys=0.00,
>>> real=0.00 secs]
>>> 2011-02-28T22:53:04.956+0000: 263248.832: [GC 263248.832: [ParNew:
>>> 19068K->2112K(19136K), 0.0029690 secs] 378552K->36163
>>> 0K(641944K) icms_dc=0 , 0.0030460 secs] [Times: user=0.02 sys=0.00,
>>> real=0.00 secs]
>>> 2011-02-28T22:53:12.664+0000: 263256.540: [GC 263256.541: [ParNew:
>>> 19136K->2112K(19136K), 0.0037690 secs] 378654K->36164
>>> 7K(641944K) icms_dc=0 , 0.0038660 secs] [Times: user=0.01 sys=0.00,
>>> real=0.01 secs]
>>> 2011-02-28T22:53:20.608+0000: 263264.484: [GC 263264.484: [ParNew:
>>> 19136K->1810K(19136K), 0.0032600 secs] 378671K->36134
>>> 6K(641944K) icms_dc=0 , 0.0033470 secs] [Times: user=0.02 sys=0.00,
>>> real=0.01 secs]
>>> 2011-02-28T22:53:28.278+0000: 263272.154: [GC 263272.154: [ParNew:
>>> 18834K->2112K(19136K), 0.0029870 secs] 378370K->36166
>>> 1K(641944K) icms_dc=0 , 0.0030670 secs] [Times: user=0.01 sys=0.00,
>>> real=0.01 secs]
>>> 2011-02-28T22:53:35.919+0000: 263279.795: [GC 263279.795: [ParNew:
>>> 19136K->1747K(19136K), 0.0037090 secs] 378685K->36129
>>> 8K(641944K) icms_dc=0 , 0.0037920 secs] [Times: user=0.02 sys=0.00,
>>> real=0.01 secs]
>>> 11/02/28 22:53:36 INFO master.ServerManager: 9 region servers, 0 dead,
>>> average load 760.2222222222222
>>>
>>>
>>> On Mon, Feb 28, 2011 at 2:49 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>> Are you able to tell if that call in metaScanner is hanging or it's
>>>> making multiple calls to the .META. region?
>>>>
>>>> If former, then jstack the region server that hosts .META. and see
>>>> where it's blocked.
>>>>
>>>> if latter, then it means your .META. region is slow? Again, what's
>>>> going on on the RS that hosts .META.?
>>>>
>>>> Finally, what's the master's log like during that time?
>>>>
>>>> J-D
>>>>
>>>> On Mon, Feb 28, 2011 at 2:41 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>> J-D,
>>>>>
>>>>>  Thanks so much for your help so far! I sent disable commands on 4
>>>>> rather small tables and they got stuck for a long time again, so I
>>>>> took jstack of the master. From what I can tell, all disableTable
>>>>> calls are blocked by a meta scanner thread (sample log below). At the
>>>>> moment there were no other requests to the server at all. How should I
>>>>> investigate this further? If it helps, here's some background: I have
>>>>> several datasets, each of them is in a separate table; Our data
>>>>> pipeline produces a new version of each dataset everyday and only the
>>>>> lastest version should be used. This is how the data is loaded: for
>>>>> each dataset 1) run a MR job and outputs HFiles 2) call loadTable.rb
>>>>> to create a new table 3) disable and drop the previous version. As a
>>>>> result some calls to load table and drop table would overlap. Please
>>>>> let me know if something stands out to you as a potential culprit.
>>>>> Thanks!
>>>>>
>>>>> BTW, I am running Hadoop 0.20.2 with HBase 0.20.6
>>>>>
>>>>> Thread 47 (IPC Server handler 13 on 60000):
>>>>>  State: BLOCKED
>>>>>  Blocked count: 3801
>>>>>  Waited count: 72719
>>>>>  Blocked on java.lang.Object@75ac522c
>>>>>  Blocked by 27 (RegionManager.metaScanner)
>>>>>  Stack:
>>>>>    org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154)
>>>>>    org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842)
>>>>>    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>>>    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>    java.lang.reflect.Method.invoke(Method.java:597)
>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>>>>    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>
>>>>>
>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>  State: WAITING
>>>>>  Blocked count: 1526058
>>>>>  Waited count: 1488998
>>>>>  Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0
>>>>>  Stack:
>>>>>    java.lang.Object.wait(Native Method)
>>>>>    java.lang.Object.wait(Object.java:485)
>>>>>    org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>>>>    $Proxy1.get(Unknown Source)
>>>>>    org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543)
>>>>>    org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192)
>>>>>    org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>>>>    org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>>>>    org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
>>>>>    org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Feb 25, 2011 at 10:23 AM, Jean-Daniel Cryans
>>>>> <jd...@apache.org> wrote:
>>>>>> An hour to disable? That doesn't sound right at all :)
>>>>>>
>>>>>> I would approach this problem like I generally do with HBase issue,
>>>>>> first check the master log for any weirdness regarding my problem (in
>>>>>> this case, grep for the table name).
>>>>>>
>>>>>> Then I would look the region server log(s) of the nodes that were
>>>>>> hosting regions from that table. You should see the steps taken to
>>>>>> disable the regions (starting to close, flush, region completely
>>>>>> closed).
>>>>>>
>>>>>> If you are able to do it while it's taking a very long time to
>>>>>> disable, try to jstack the process the seems to be hanging.
>>>>>>
>>>>>> Finally, like I said in another thread, there's a bug in 0.20.6 that
>>>>>> almost prevent disabling a table (or re-enabling) if any region
>>>>>> recently split and the parent wasn't cleaned yet from .META. and that
>>>>>> is fixed in 0.90.1
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>> I think you are right, maybe in the long run I need to re-architect my
>>>>>>> system so that it doesn't need to create new and delete old tables all
>>>>>>> the time. In the short term I am having a really hard time with the
>>>>>>> disabling function, I ran a disable command on a very small table
>>>>>>> (probably dozen of MBs in size) and are no client using the cluster at
>>>>>>> all, and that took about 1 hour to complete! The weird thing is on the
>>>>>>> web UI only the region server carrying the META table has non-zero
>>>>>>> requests, all other RS have 0 requests the entire time. I would think
>>>>>>> they should get some request to flush the memstore at least. I *am*
>>>>>>> using the same RS nodes for some map reduce job at the time and top
>>>>>>> shows the memory usage is almost full on the META region. Would you
>>>>>>> have some idea of what I should investigate?
>>>>>>> Thanks so much.
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Disabling a table taking very long time

Posted by Nanheng Wu <na...@gmail.com>.

I issued two more disable commands and looked at the .META RS. The
last log entry looks like:

11/03/01 00:29:14 INFO regionserver.HLog: Roll
/hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog
.dat.1298935752660, entries=13527, calcsize=3839046, filesize=2560143.
New hlog /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog.dat.1298939354072

I issued the command at: 00:34:59, the client debug message has
nothing but "11/03/01 00:34:59 DEBUG zookeeper.ClientCnxn: Got ping
response for sessionid:0x12e6e938e9c000b after 90ms" until 00:47:19.
Then all of a sudden it outputs the following:

0:47:28 DEBUG client.HConnectionManager$TableServers: Cache hit for
row <> in tableName .META.: location server 10.146.14.38:60020,
location region name .META.,,1
11/03/01 00:47:28 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode
/hbase/root-region-server got 10.146.14.38:60020
11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Found
ROOT at 10.146.14.38:60020
11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Cached
location for .META.,,1 is 10.146.14.38:60020
11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
Rowscanned=261, rowsOffline=3
11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
regions to be disabled from myTable
11/03/01 00:47:31 DEBUG client.HBaseAdmin: Wake. Waiting for all
regions to be disabled from myTable

Here's the full thread dump of the .META RS from 00:34:59 to  00:47:31
http://pastebin.com/Ywwx8afQ

I hope I compiled some useful information here. I am pretty lost at
this point, I would really appreciate your help!

Thanks!



On Mon, Feb 28, 2011 at 2:58 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
> Ok so no debug from the disable command? BTW you should output your GC
> somewhere else like the .out file or a completely separate file.
>
> J-D
>
> On Mon, Feb 28, 2011 at 2:54 PM, Nanheng Wu <na...@gmail.com> wrote:
>> I see, so I should jstack the .META region. I'll do that.
>>
>> The master log pretty much looks like this: should I grep for
>> something specific?
>>
>> 11/02/28 22:52:56 INFO master.BaseScanner: RegionManager.rootScanner
>> scan of 1 row(s) of meta region complete
>> 2011-02-28T22:52:57.138+0000: 263241.014: [GC 263241.014: [ParNew:
>> 19136K->2112K(19136K), 0.0033350 secs] 378606K->36159
>> 6K(641944K) icms_dc=0 , 0.0034330 secs] [Times: user=0.02 sys=0.00,
>> real=0.00 secs]
>> 2011-02-28T22:53:04.956+0000: 263248.832: [GC 263248.832: [ParNew:
>> 19068K->2112K(19136K), 0.0029690 secs] 378552K->36163
>> 0K(641944K) icms_dc=0 , 0.0030460 secs] [Times: user=0.02 sys=0.00,
>> real=0.00 secs]
>> 2011-02-28T22:53:12.664+0000: 263256.540: [GC 263256.541: [ParNew:
>> 19136K->2112K(19136K), 0.0037690 secs] 378654K->36164
>> 7K(641944K) icms_dc=0 , 0.0038660 secs] [Times: user=0.01 sys=0.00,
>> real=0.01 secs]
>> 2011-02-28T22:53:20.608+0000: 263264.484: [GC 263264.484: [ParNew:
>> 19136K->1810K(19136K), 0.0032600 secs] 378671K->36134
>> 6K(641944K) icms_dc=0 , 0.0033470 secs] [Times: user=0.02 sys=0.00,
>> real=0.01 secs]
>> 2011-02-28T22:53:28.278+0000: 263272.154: [GC 263272.154: [ParNew:
>> 18834K->2112K(19136K), 0.0029870 secs] 378370K->36166
>> 1K(641944K) icms_dc=0 , 0.0030670 secs] [Times: user=0.01 sys=0.00,
>> real=0.01 secs]
>> 2011-02-28T22:53:35.919+0000: 263279.795: [GC 263279.795: [ParNew:
>> 19136K->1747K(19136K), 0.0037090 secs] 378685K->36129
>> 8K(641944K) icms_dc=0 , 0.0037920 secs] [Times: user=0.02 sys=0.00,
>> real=0.01 secs]
>> 11/02/28 22:53:36 INFO master.ServerManager: 9 region servers, 0 dead,
>> average load 760.2222222222222
>>
>>
>> On Mon, Feb 28, 2011 at 2:49 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>> Are you able to tell if that call in metaScanner is hanging or it's
>>> making multiple calls to the .META. region?
>>>
>>> If former, then jstack the region server that hosts .META. and see
>>> where it's blocked.
>>>
>>> if latter, then it means your .META. region is slow? Again, what's
>>> going on on the RS that hosts .META.?
>>>
>>> Finally, what's the master's log like during that time?
>>>
>>> J-D
>>>
>>> On Mon, Feb 28, 2011 at 2:41 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>> J-D,
>>>>
>>>>  Thanks so much for your help so far! I sent disable commands on 4
>>>> rather small tables and they got stuck for a long time again, so I
>>>> took jstack of the master. From what I can tell, all disableTable
>>>> calls are blocked by a meta scanner thread (sample log below). At the
>>>> moment there were no other requests to the server at all. How should I
>>>> investigate this further? If it helps, here's some background: I have
>>>> several datasets, each of them is in a separate table; Our data
>>>> pipeline produces a new version of each dataset everyday and only the
>>>> lastest version should be used. This is how the data is loaded: for
>>>> each dataset 1) run a MR job and outputs HFiles 2) call loadTable.rb
>>>> to create a new table 3) disable and drop the previous version. As a
>>>> result some calls to load table and drop table would overlap. Please
>>>> let me know if something stands out to you as a potential culprit.
>>>> Thanks!
>>>>
>>>> BTW, I am running Hadoop 0.20.2 with HBase 0.20.6
>>>>
>>>> Thread 47 (IPC Server handler 13 on 60000):
>>>>  State: BLOCKED
>>>>  Blocked count: 3801
>>>>  Waited count: 72719
>>>>  Blocked on java.lang.Object@75ac522c
>>>>  Blocked by 27 (RegionManager.metaScanner)
>>>>  Stack:
>>>>    org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154)
>>>>    org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842)
>>>>    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>>    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>    java.lang.reflect.Method.invoke(Method.java:597)
>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>>>    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>
>>>>
>>>> Thread 27 (RegionManager.metaScanner):
>>>>  State: WAITING
>>>>  Blocked count: 1526058
>>>>  Waited count: 1488998
>>>>  Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0
>>>>  Stack:
>>>>    java.lang.Object.wait(Native Method)
>>>>    java.lang.Object.wait(Object.java:485)
>>>>    org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>>>    $Proxy1.get(Unknown Source)
>>>>    org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543)
>>>>    org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192)
>>>>    org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>>>    org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>>>    org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
>>>>    org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Feb 25, 2011 at 10:23 AM, Jean-Daniel Cryans
>>>> <jd...@apache.org> wrote:
>>>>> An hour to disable? That doesn't sound right at all :)
>>>>>
>>>>> I would approach this problem like I generally do with HBase issue,
>>>>> first check the master log for any weirdness regarding my problem (in
>>>>> this case, grep for the table name).
>>>>>
>>>>> Then I would look the region server log(s) of the nodes that were
>>>>> hosting regions from that table. You should see the steps taken to
>>>>> disable the regions (starting to close, flush, region completely
>>>>> closed).
>>>>>
>>>>> If you are able to do it while it's taking a very long time to
>>>>> disable, try to jstack the process the seems to be hanging.
>>>>>
>>>>> Finally, like I said in another thread, there's a bug in 0.20.6 that
>>>>> almost prevent disabling a table (or re-enabling) if any region
>>>>> recently split and the parent wasn't cleaned yet from .META. and that
>>>>> is fixed in 0.90.1
>>>>>
>>>>> J-D
>>>>>
>>>>> On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>> I think you are right, maybe in the long run I need to re-architect my
>>>>>> system so that it doesn't need to create new and delete old tables all
>>>>>> the time. In the short term I am having a really hard time with the
>>>>>> disabling function, I ran a disable command on a very small table
>>>>>> (probably dozen of MBs in size) and are no client using the cluster at
>>>>>> all, and that took about 1 hour to complete! The weird thing is on the
>>>>>> web UI only the region server carrying the META table has non-zero
>>>>>> requests, all other RS have 0 requests the entire time. I would think
>>>>>> they should get some request to flush the memstore at least. I *am*
>>>>>> using the same RS nodes for some map reduce job at the time and top
>>>>>> shows the memory usage is almost full on the META region. Would you
>>>>>> have some idea of what I should investigate?
>>>>>> Thanks so much.
>>>>>
>>>>
>>>
>>
>

Re: Disabling a table taking very long time

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Ok so no debug from the disable command? BTW you should output your GC
somewhere else like the .out file or a completely separate file.

J-D

On Mon, Feb 28, 2011 at 2:54 PM, Nanheng Wu <na...@gmail.com> wrote:
> I see, so I should jstack the .META region. I'll do that.
>
> The master log pretty much looks like this: should I grep for
> something specific?
>
> 11/02/28 22:52:56 INFO master.BaseScanner: RegionManager.rootScanner
> scan of 1 row(s) of meta region complete
> 2011-02-28T22:52:57.138+0000: 263241.014: [GC 263241.014: [ParNew:
> 19136K->2112K(19136K), 0.0033350 secs] 378606K->36159
> 6K(641944K) icms_dc=0 , 0.0034330 secs] [Times: user=0.02 sys=0.00,
> real=0.00 secs]
> 2011-02-28T22:53:04.956+0000: 263248.832: [GC 263248.832: [ParNew:
> 19068K->2112K(19136K), 0.0029690 secs] 378552K->36163
> 0K(641944K) icms_dc=0 , 0.0030460 secs] [Times: user=0.02 sys=0.00,
> real=0.00 secs]
> 2011-02-28T22:53:12.664+0000: 263256.540: [GC 263256.541: [ParNew:
> 19136K->2112K(19136K), 0.0037690 secs] 378654K->36164
> 7K(641944K) icms_dc=0 , 0.0038660 secs] [Times: user=0.01 sys=0.00,
> real=0.01 secs]
> 2011-02-28T22:53:20.608+0000: 263264.484: [GC 263264.484: [ParNew:
> 19136K->1810K(19136K), 0.0032600 secs] 378671K->36134
> 6K(641944K) icms_dc=0 , 0.0033470 secs] [Times: user=0.02 sys=0.00,
> real=0.01 secs]
> 2011-02-28T22:53:28.278+0000: 263272.154: [GC 263272.154: [ParNew:
> 18834K->2112K(19136K), 0.0029870 secs] 378370K->36166
> 1K(641944K) icms_dc=0 , 0.0030670 secs] [Times: user=0.01 sys=0.00,
> real=0.01 secs]
> 2011-02-28T22:53:35.919+0000: 263279.795: [GC 263279.795: [ParNew:
> 19136K->1747K(19136K), 0.0037090 secs] 378685K->36129
> 8K(641944K) icms_dc=0 , 0.0037920 secs] [Times: user=0.02 sys=0.00,
> real=0.01 secs]
> 11/02/28 22:53:36 INFO master.ServerManager: 9 region servers, 0 dead,
> average load 760.2222222222222
>
>
> On Mon, Feb 28, 2011 at 2:49 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> Are you able to tell if that call in metaScanner is hanging or it's
>> making multiple calls to the .META. region?
>>
>> If former, then jstack the region server that hosts .META. and see
>> where it's blocked.
>>
>> if latter, then it means your .META. region is slow? Again, what's
>> going on on the RS that hosts .META.?
>>
>> Finally, what's the master's log like during that time?
>>
>> J-D
>>
>> On Mon, Feb 28, 2011 at 2:41 PM, Nanheng Wu <na...@gmail.com> wrote:
>>> J-D,
>>>
>>>  Thanks so much for your help so far! I sent disable commands on 4
>>> rather small tables and they got stuck for a long time again, so I
>>> took jstack of the master. From what I can tell, all disableTable
>>> calls are blocked by a meta scanner thread (sample log below). At the
>>> moment there were no other requests to the server at all. How should I
>>> investigate this further? If it helps, here's some background: I have
>>> several datasets, each of them is in a separate table; Our data
>>> pipeline produces a new version of each dataset everyday and only the
>>> lastest version should be used. This is how the data is loaded: for
>>> each dataset 1) run a MR job and outputs HFiles 2) call loadTable.rb
>>> to create a new table 3) disable and drop the previous version. As a
>>> result some calls to load table and drop table would overlap. Please
>>> let me know if something stands out to you as a potential culprit.
>>> Thanks!
>>>
>>> BTW, I am running Hadoop 0.20.2 with HBase 0.20.6
>>>
>>> Thread 47 (IPC Server handler 13 on 60000):
>>>  State: BLOCKED
>>>  Blocked count: 3801
>>>  Waited count: 72719
>>>  Blocked on java.lang.Object@75ac522c
>>>  Blocked by 27 (RegionManager.metaScanner)
>>>  Stack:
>>>    org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154)
>>>    org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842)
>>>    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>    java.lang.reflect.Method.invoke(Method.java:597)
>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>>    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>
>>>
>>> Thread 27 (RegionManager.metaScanner):
>>>  State: WAITING
>>>  Blocked count: 1526058
>>>  Waited count: 1488998
>>>  Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0
>>>  Stack:
>>>    java.lang.Object.wait(Native Method)
>>>    java.lang.Object.wait(Object.java:485)
>>>    org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>>    $Proxy1.get(Unknown Source)
>>>    org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543)
>>>    org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192)
>>>    org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>>    org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>>    org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
>>>    org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>>>
>>>
>>>
>>>
>>> On Fri, Feb 25, 2011 at 10:23 AM, Jean-Daniel Cryans
>>> <jd...@apache.org> wrote:
>>>> An hour to disable? That doesn't sound right at all :)
>>>>
>>>> I would approach this problem like I generally do with HBase issue,
>>>> first check the master log for any weirdness regarding my problem (in
>>>> this case, grep for the table name).
>>>>
>>>> Then I would look the region server log(s) of the nodes that were
>>>> hosting regions from that table. You should see the steps taken to
>>>> disable the regions (starting to close, flush, region completely
>>>> closed).
>>>>
>>>> If you are able to do it while it's taking a very long time to
>>>> disable, try to jstack the process the seems to be hanging.
>>>>
>>>> Finally, like I said in another thread, there's a bug in 0.20.6 that
>>>> almost prevent disabling a table (or re-enabling) if any region
>>>> recently split and the parent wasn't cleaned yet from .META. and that
>>>> is fixed in 0.90.1
>>>>
>>>> J-D
>>>>
>>>> On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>> I think you are right, maybe in the long run I need to re-architect my
>>>>> system so that it doesn't need to create new and delete old tables all
>>>>> the time. In the short term I am having a really hard time with the
>>>>> disabling function, I ran a disable command on a very small table
>>>>> (probably dozen of MBs in size) and are no client using the cluster at
>>>>> all, and that took about 1 hour to complete! The weird thing is on the
>>>>> web UI only the region server carrying the META table has non-zero
>>>>> requests, all other RS have 0 requests the entire time. I would think
>>>>> they should get some request to flush the memstore at least. I *am*
>>>>> using the same RS nodes for some map reduce job at the time and top
>>>>> shows the memory usage is almost full on the META region. Would you
>>>>> have some idea of what I should investigate?
>>>>> Thanks so much.
>>>>
>>>
>>
>

Re: Disabling a table taking very long time

Posted by Nanheng Wu <na...@gmail.com>.

I see, so I should jstack the .META region. I'll do that.

The master log pretty much looks like this: should I grep for
something specific?

11/02/28 22:52:56 INFO master.BaseScanner: RegionManager.rootScanner
scan of 1 row(s) of meta region complete
2011-02-28T22:52:57.138+0000: 263241.014: [GC 263241.014: [ParNew:
19136K->2112K(19136K), 0.0033350 secs] 378606K->36159
6K(641944K) icms_dc=0 , 0.0034330 secs] [Times: user=0.02 sys=0.00,
real=0.00 secs]
2011-02-28T22:53:04.956+0000: 263248.832: [GC 263248.832: [ParNew:
19068K->2112K(19136K), 0.0029690 secs] 378552K->36163
0K(641944K) icms_dc=0 , 0.0030460 secs] [Times: user=0.02 sys=0.00,
real=0.00 secs]
2011-02-28T22:53:12.664+0000: 263256.540: [GC 263256.541: [ParNew:
19136K->2112K(19136K), 0.0037690 secs] 378654K->36164
7K(641944K) icms_dc=0 , 0.0038660 secs] [Times: user=0.01 sys=0.00,
real=0.01 secs]
2011-02-28T22:53:20.608+0000: 263264.484: [GC 263264.484: [ParNew:
19136K->1810K(19136K), 0.0032600 secs] 378671K->36134
6K(641944K) icms_dc=0 , 0.0033470 secs] [Times: user=0.02 sys=0.00,
real=0.01 secs]
2011-02-28T22:53:28.278+0000: 263272.154: [GC 263272.154: [ParNew:
18834K->2112K(19136K), 0.0029870 secs] 378370K->36166
1K(641944K) icms_dc=0 , 0.0030670 secs] [Times: user=0.01 sys=0.00,
real=0.01 secs]
2011-02-28T22:53:35.919+0000: 263279.795: [GC 263279.795: [ParNew:
19136K->1747K(19136K), 0.0037090 secs] 378685K->36129
8K(641944K) icms_dc=0 , 0.0037920 secs] [Times: user=0.02 sys=0.00,
real=0.01 secs]
11/02/28 22:53:36 INFO master.ServerManager: 9 region servers, 0 dead,
average load 760.2222222222222


On Mon, Feb 28, 2011 at 2:49 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
> Are you able to tell if that call in metaScanner is hanging or it's
> making multiple calls to the .META. region?
>
> If former, then jstack the region server that hosts .META. and see
> where it's blocked.
>
> if latter, then it means your .META. region is slow? Again, what's
> going on on the RS that hosts .META.?
>
> Finally, what's the master's log like during that time?
>
> J-D
>
> On Mon, Feb 28, 2011 at 2:41 PM, Nanheng Wu <na...@gmail.com> wrote:
>> J-D,
>>
>>  Thanks so much for your help so far! I sent disable commands on 4
>> rather small tables and they got stuck for a long time again, so I
>> took jstack of the master. From what I can tell, all disableTable
>> calls are blocked by a meta scanner thread (sample log below). At the
>> moment there were no other requests to the server at all. How should I
>> investigate this further? If it helps, here's some background: I have
>> several datasets, each of them is in a separate table; Our data
>> pipeline produces a new version of each dataset everyday and only the
>> lastest version should be used. This is how the data is loaded: for
>> each dataset 1) run a MR job and outputs HFiles 2) call loadTable.rb
>> to create a new table 3) disable and drop the previous version. As a
>> result some calls to load table and drop table would overlap. Please
>> let me know if something stands out to you as a potential culprit.
>> Thanks!
>>
>> BTW, I am running Hadoop 0.20.2 with HBase 0.20.6
>>
>> Thread 47 (IPC Server handler 13 on 60000):
>>  State: BLOCKED
>>  Blocked count: 3801
>>  Waited count: 72719
>>  Blocked on java.lang.Object@75ac522c
>>  Blocked by 27 (RegionManager.metaScanner)
>>  Stack:
>>    org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154)
>>    org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842)
>>    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>    java.lang.reflect.Method.invoke(Method.java:597)
>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>
>>
>> Thread 27 (RegionManager.metaScanner):
>>  State: WAITING
>>  Blocked count: 1526058
>>  Waited count: 1488998
>>  Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0
>>  Stack:
>>    java.lang.Object.wait(Native Method)
>>    java.lang.Object.wait(Object.java:485)
>>    org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>    $Proxy1.get(Unknown Source)
>>    org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543)
>>    org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192)
>>    org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>    org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>    org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
>>    org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>>
>>
>>
>>
>> On Fri, Feb 25, 2011 at 10:23 AM, Jean-Daniel Cryans
>> <jd...@apache.org> wrote:
>>> An hour to disable? That doesn't sound right at all :)
>>>
>>> I would approach this problem like I generally do with HBase issue,
>>> first check the master log for any weirdness regarding my problem (in
>>> this case, grep for the table name).
>>>
>>> Then I would look the region server log(s) of the nodes that were
>>> hosting regions from that table. You should see the steps taken to
>>> disable the regions (starting to close, flush, region completely
>>> closed).
>>>
>>> If you are able to do it while it's taking a very long time to
>>> disable, try to jstack the process the seems to be hanging.
>>>
>>> Finally, like I said in another thread, there's a bug in 0.20.6 that
>>> almost prevent disabling a table (or re-enabling) if any region
>>> recently split and the parent wasn't cleaned yet from .META. and that
>>> is fixed in 0.90.1
>>>
>>> J-D
>>>
>>> On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>> I think you are right, maybe in the long run I need to re-architect my
>>>> system so that it doesn't need to create new and delete old tables all
>>>> the time. In the short term I am having a really hard time with the
>>>> disabling function, I ran a disable command on a very small table
>>>> (probably dozen of MBs in size) and are no client using the cluster at
>>>> all, and that took about 1 hour to complete! The weird thing is on the
>>>> web UI only the region server carrying the META table has non-zero
>>>> requests, all other RS have 0 requests the entire time. I would think
>>>> they should get some request to flush the memstore at least. I *am*
>>>> using the same RS nodes for some map reduce job at the time and top
>>>> shows the memory usage is almost full on the META region. Would you
>>>> have some idea of what I should investigate?
>>>> Thanks so much.
>>>
>>
>

Re: Disabling a table taking very long time

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Are you able to tell if that call in metaScanner is hanging or it's
making multiple calls to the .META. region?

If former, then jstack the region server that hosts .META. and see
where it's blocked.

if latter, then it means your .META. region is slow? Again, what's
going on on the RS that hosts .META.?

Finally, what's the master's log like during that time?

J-D

On Mon, Feb 28, 2011 at 2:41 PM, Nanheng Wu <na...@gmail.com> wrote:
> J-D,
>
>  Thanks so much for your help so far! I sent disable commands on 4
> rather small tables and they got stuck for a long time again, so I
> took jstack of the master. From what I can tell, all disableTable
> calls are blocked by a meta scanner thread (sample log below). At the
> moment there were no other requests to the server at all. How should I
> investigate this further? If it helps, here's some background: I have
> several datasets, each of them is in a separate table; Our data
> pipeline produces a new version of each dataset everyday and only the
> lastest version should be used. This is how the data is loaded: for
> each dataset 1) run a MR job and outputs HFiles 2) call loadTable.rb
> to create a new table 3) disable and drop the previous version. As a
> result some calls to load table and drop table would overlap. Please
> let me know if something stands out to you as a potential culprit.
> Thanks!
>
> BTW, I am running Hadoop 0.20.2 with HBase 0.20.6
>
> Thread 47 (IPC Server handler 13 on 60000):
>  State: BLOCKED
>  Blocked count: 3801
>  Waited count: 72719
>  Blocked on java.lang.Object@75ac522c
>  Blocked by 27 (RegionManager.metaScanner)
>  Stack:
>    org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154)
>    org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842)
>    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>    java.lang.reflect.Method.invoke(Method.java:597)
>    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>
>
> Thread 27 (RegionManager.metaScanner):
>  State: WAITING
>  Blocked count: 1526058
>  Waited count: 1488998
>  Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0
>  Stack:
>    java.lang.Object.wait(Native Method)
>    java.lang.Object.wait(Object.java:485)
>    org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>    org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>    $Proxy1.get(Unknown Source)
>    org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543)
>    org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192)
>    org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>    org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>    org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
>    org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>
>
>
>
> On Fri, Feb 25, 2011 at 10:23 AM, Jean-Daniel Cryans
> <jd...@apache.org> wrote:
>> An hour to disable? That doesn't sound right at all :)
>>
>> I would approach this problem like I generally do with HBase issue,
>> first check the master log for any weirdness regarding my problem (in
>> this case, grep for the table name).
>>
>> Then I would look the region server log(s) of the nodes that were
>> hosting regions from that table. You should see the steps taken to
>> disable the regions (starting to close, flush, region completely
>> closed).
>>
>> If you are able to do it while it's taking a very long time to
>> disable, try to jstack the process the seems to be hanging.
>>
>> Finally, like I said in another thread, there's a bug in 0.20.6 that
>> almost prevent disabling a table (or re-enabling) if any region
>> recently split and the parent wasn't cleaned yet from .META. and that
>> is fixed in 0.90.1
>>
>> J-D
>>
>> On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu <na...@gmail.com> wrote:
>>> I think you are right, maybe in the long run I need to re-architect my
>>> system so that it doesn't need to create new and delete old tables all
>>> the time. In the short term I am having a really hard time with the
>>> disabling function, I ran a disable command on a very small table
>>> (probably dozen of MBs in size) and are no client using the cluster at
>>> all, and that took about 1 hour to complete! The weird thing is on the
>>> web UI only the region server carrying the META table has non-zero
>>> requests, all other RS have 0 requests the entire time. I would think
>>> they should get some request to flush the memstore at least. I *am*
>>> using the same RS nodes for some map reduce job at the time and top
>>> shows the memory usage is almost full on the META region. Would you
>>> have some idea of what I should investigate?
>>> Thanks so much.
>>
>

Re: Disabling a table taking very long time

Posted by Nanheng Wu <na...@gmail.com>.

J-D,

  Thanks so much for your help so far! I sent disable commands on 4
rather small tables and they got stuck for a long time again, so I
took jstack of the master. From what I can tell, all disableTable
calls are blocked by a meta scanner thread (sample log below). At the
moment there were no other requests to the server at all. How should I
investigate this further? If it helps, here's some background: I have
several datasets, each of them is in a separate table; Our data
pipeline produces a new version of each dataset everyday and only the
lastest version should be used. This is how the data is loaded: for
each dataset 1) run a MR job and outputs HFiles 2) call loadTable.rb
to create a new table 3) disable and drop the previous version. As a
result some calls to load table and drop table would overlap. Please
let me know if something stands out to you as a potential culprit.
Thanks!

BTW, I am running Hadoop 0.20.2 with HBase 0.20.6

Thread 47 (IPC Server handler 13 on 60000):
  State: BLOCKED
  Blocked count: 3801
  Waited count: 72719
  Blocked on java.lang.Object@75ac522c
  Blocked by 27 (RegionManager.metaScanner)
  Stack:
    org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154)
    org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842)
    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    java.lang.reflect.Method.invoke(Method.java:597)
    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)


Thread 27 (RegionManager.metaScanner):
  State: WAITING
  Blocked count: 1526058
  Waited count: 1488998
  Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
    org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
    $Proxy1.get(Unknown Source)
    org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543)
    org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192)
    org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
    org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
    org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
    org.apache.hadoop.hbase.Chore.run(Chore.java:68)




On Fri, Feb 25, 2011 at 10:23 AM, Jean-Daniel Cryans
<jd...@apache.org> wrote:
> An hour to disable? That doesn't sound right at all :)
>
> I would approach this problem like I generally do with HBase issue,
> first check the master log for any weirdness regarding my problem (in
> this case, grep for the table name).
>
> Then I would look the region server log(s) of the nodes that were
> hosting regions from that table. You should see the steps taken to
> disable the regions (starting to close, flush, region completely
> closed).
>
> If you are able to do it while it's taking a very long time to
> disable, try to jstack the process the seems to be hanging.
>
> Finally, like I said in another thread, there's a bug in 0.20.6 that
> almost prevent disabling a table (or re-enabling) if any region
> recently split and the parent wasn't cleaned yet from .META. and that
> is fixed in 0.90.1
>
> J-D
>
> On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu <na...@gmail.com> wrote:
>> I think you are right, maybe in the long run I need to re-architect my
>> system so that it doesn't need to create new and delete old tables all
>> the time. In the short term I am having a really hard time with the
>> disabling function, I ran a disable command on a very small table
>> (probably dozen of MBs in size) and are no client using the cluster at
>> all, and that took about 1 hour to complete! The weird thing is on the
>> web UI only the region server carrying the META table has non-zero
>> requests, all other RS have 0 requests the entire time. I would think
>> they should get some request to flush the memstore at least. I *am*
>> using the same RS nodes for some map reduce job at the time and top
>> shows the memory usage is almost full on the META region. Would you
>> have some idea of what I should investigate?
>> Thanks so much.
>

Re: Disabling a table taking very long time

Posted by Jean-Daniel Cryans <jd...@apache.org>.

An hour to disable? That doesn't sound right at all :)

I would approach this problem like I generally do with HBase issue,
first check the master log for any weirdness regarding my problem (in
this case, grep for the table name).

Then I would look the region server log(s) of the nodes that were
hosting regions from that table. You should see the steps taken to
disable the regions (starting to close, flush, region completely
closed).

If you are able to do it while it's taking a very long time to
disable, try to jstack the process the seems to be hanging.

Finally, like I said in another thread, there's a bug in 0.20.6 that
almost prevent disabling a table (or re-enabling) if any region
recently split and the parent wasn't cleaned yet from .META. and that
is fixed in 0.90.1

J-D

On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu <na...@gmail.com> wrote:
> I think you are right, maybe in the long run I need to re-architect my
> system so that it doesn't need to create new and delete old tables all
> the time. In the short term I am having a really hard time with the
> disabling function, I ran a disable command on a very small table
> (probably dozen of MBs in size) and are no client using the cluster at
> all, and that took about 1 hour to complete! The weird thing is on the
> web UI only the region server carrying the META table has non-zero
> requests, all other RS have 0 requests the entire time. I would think
> they should get some request to flush the memstore at least. I *am*
> using the same RS nodes for some map reduce job at the time and top
> shows the memory usage is almost full on the META region. Would you
> have some idea of what I should investigate?
> Thanks so much.

Re: Disabling a table taking very long time

Posted by Nanheng Wu <na...@gmail.com>.

I think you are right, maybe in the long run I need to re-architect my
system so that it doesn't need to create new and delete old tables all
the time. In the short term I am having a really hard time with the
disabling function, I ran a disable command on a very small table
(probably dozen of MBs in size) and are no client using the cluster at
all, and that took about 1 hour to complete! The weird thing is on the
web UI only the region server carrying the META table has non-zero
requests, all other RS have 0 requests the entire time. I would think
they should get some request to flush the memstore at least. I *am*
using the same RS nodes for some map reduce job at the time and top
shows the memory usage is almost full on the META region. Would you
have some idea of what I should investigate?
Thanks so much.

On Thu, Feb 24, 2011 at 3:24 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
> I haven't tried, but it seems incredibly hacky and bound to generate
> more problems than it solves. Instead you could consider using
> different table names.
>
> J-D
>
> On Thu, Feb 24, 2011 at 3:21 PM, Nanheng Wu <na...@gmail.com> wrote:
>> What would happen if I try to remove the region files from hdfs and
>> then remove the regions from the meta table? Is hbase gonna get messed
>> up when it later needs to flush those regions out of the memstore?
>> Thanks!
>>
>> On Thursday, February 24, 2011, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>> Exactly.
>>>
>>> J-D
>>>
>>> On Thu, Feb 24, 2011 at 2:45 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>> Sorry for trying to bring this topic back again guys, so currently in
>>>> 0.20.6 is there's no way to drop a table without large amount of
>>>> flushing?
>>>>
>>>> On Tue, Feb 22, 2011 at 3:04 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>> Moving this discussion to jira: https://issues.apache.org/jira/browse/HBASE-3557
>>>>>
>>>>> Good stuff guys.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Mon, Feb 21, 2011 at 8:25 AM, Hari Sreekumar
>>>>> <hs...@clickable.com> wrote:
>>>>>> I agree, and I have missed this a lot too. We should have a way to directly
>>>>>> drop the table, it would help save a lot of time while developing /testing.
>>>>>> Maybe also have a config entry to enable this behavior, like we have
>>>>>> webinterface.private.actions in hadoop. It can be enabled on the dev cluster
>>>>>> and disabled on production etc.
>>>>>>
>>>>>> Hari
>>>>>>
>>>>>> On Mon, Feb 21, 2011 at 12:41 PM, Lars George <la...@gmail.com> wrote:
>>>>>>
>>>>>>> Could be having some "force" flag specified 3 times and asked for
>>>>>>> confirmation as well, but I like this feature. Whenever I talk to
>>>>>>> people who disable and get stuck it was to prepare a subsequent drop
>>>>>>> table call. So this sounds really useful given enough safety latches
>>>>>>> in place.
>>>>>>>
>>>>>>> Lars
>>>>>>>
>>>>>>> On Thu, Feb 17, 2011 at 1:50 AM, Jean-Daniel Cryans <jd...@apache.org>
>>>>>>> wrote:
>>>>>>> > Actually I never thought of having a special case for that... and I
>>>>>>> > don't see any jira about it. Would you mind opening a new one for
>>>>>>> > that, I think it's a good idea for those times when you're developing
>>>>>>> > something and you want to iterate fast.
>>>>>>> >
>>>>>>> > On the other hand, it's a pretty destructive feature so some people
>>>>>>> > might disagree with having it in the codebase :)
>>>>>>> >
>>>>>>> > J-D
>>>>>>> >
>>>>>>> > On Wed, Feb 16, 2011 at 4:26 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>>> >> Actually I wanted to disable the table so I can drop it. It would be
>>>>>>> >> nice to be able to disable the table without flushing memstore. It's
>>>>>>> >> not possible in 0.20.6 is it?
>>>>>>> >>
>>>>>>> >> On Wed, Feb 16, 2011 at 2:30 PM, Jean-Daniel Cryans <
>>>>>>> jdcryans@apache.org> wrote:
>>>>>>> >>> To disable a region, it's memstore must first be flushed. That's done
>>>>>>> >>> sequentially for all regions that must be disabled inside a table.
>>>>>>> >>> There's not really a way around it unless you don't need that data.
>>>>>>> >>>
>>>>>>> >>> If you're planning to disable a table but at the same time need it to
>>>>>>> >>> be unavailable for the shortest time possible, I recommend calling a
>>>>>>> >>> flush on the table from the shell first and then some time later doing
>>>>>>> >>> the disable. How much later you ask? Well there's currently no easy
>>>>>>> >>> way to tell, I usually just tail any region server log file until I
>>>>>>> >>> see they're done.
>>>>>>> >>>
>>>>>>> >>> J-D
>>>>>>> >>>
>>>>>>> >>> On Wed, Feb 16, 2011 at 2:21 PM, Nanheng Wu <na...@gmail.com>
>>>>>>> wrote:
>>>>>>> >>>> From time to time I run into issues where disabling a table pretty
>>>>>>>
>>
>

Re: Disabling a table taking very long time

Posted by Jean-Daniel Cryans <jd...@apache.org>.

I haven't tried, but it seems incredibly hacky and bound to generate
more problems than it solves. Instead you could consider using
different table names.

J-D

On Thu, Feb 24, 2011 at 3:21 PM, Nanheng Wu <na...@gmail.com> wrote:
> What would happen if I try to remove the region files from hdfs and
> then remove the regions from the meta table? Is hbase gonna get messed
> up when it later needs to flush those regions out of the memstore?
> Thanks!
>
> On Thursday, February 24, 2011, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> Exactly.
>>
>> J-D
>>
>> On Thu, Feb 24, 2011 at 2:45 PM, Nanheng Wu <na...@gmail.com> wrote:
>>> Sorry for trying to bring this topic back again guys, so currently in
>>> 0.20.6 is there's no way to drop a table without large amount of
>>> flushing?
>>>
>>> On Tue, Feb 22, 2011 at 3:04 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>> Moving this discussion to jira: https://issues.apache.org/jira/browse/HBASE-3557
>>>>
>>>> Good stuff guys.
>>>>
>>>> J-D
>>>>
>>>> On Mon, Feb 21, 2011 at 8:25 AM, Hari Sreekumar
>>>> <hs...@clickable.com> wrote:
>>>>> I agree, and I have missed this a lot too. We should have a way to directly
>>>>> drop the table, it would help save a lot of time while developing /testing.
>>>>> Maybe also have a config entry to enable this behavior, like we have
>>>>> webinterface.private.actions in hadoop. It can be enabled on the dev cluster
>>>>> and disabled on production etc.
>>>>>
>>>>> Hari
>>>>>
>>>>> On Mon, Feb 21, 2011 at 12:41 PM, Lars George <la...@gmail.com> wrote:
>>>>>
>>>>>> Could be having some "force" flag specified 3 times and asked for
>>>>>> confirmation as well, but I like this feature. Whenever I talk to
>>>>>> people who disable and get stuck it was to prepare a subsequent drop
>>>>>> table call. So this sounds really useful given enough safety latches
>>>>>> in place.
>>>>>>
>>>>>> Lars
>>>>>>
>>>>>> On Thu, Feb 17, 2011 at 1:50 AM, Jean-Daniel Cryans <jd...@apache.org>
>>>>>> wrote:
>>>>>> > Actually I never thought of having a special case for that... and I
>>>>>> > don't see any jira about it. Would you mind opening a new one for
>>>>>> > that, I think it's a good idea for those times when you're developing
>>>>>> > something and you want to iterate fast.
>>>>>> >
>>>>>> > On the other hand, it's a pretty destructive feature so some people
>>>>>> > might disagree with having it in the codebase :)
>>>>>> >
>>>>>> > J-D
>>>>>> >
>>>>>> > On Wed, Feb 16, 2011 at 4:26 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>>> >> Actually I wanted to disable the table so I can drop it. It would be
>>>>>> >> nice to be able to disable the table without flushing memstore. It's
>>>>>> >> not possible in 0.20.6 is it?
>>>>>> >>
>>>>>> >> On Wed, Feb 16, 2011 at 2:30 PM, Jean-Daniel Cryans <
>>>>>> jdcryans@apache.org> wrote:
>>>>>> >>> To disable a region, it's memstore must first be flushed. That's done
>>>>>> >>> sequentially for all regions that must be disabled inside a table.
>>>>>> >>> There's not really a way around it unless you don't need that data.
>>>>>> >>>
>>>>>> >>> If you're planning to disable a table but at the same time need it to
>>>>>> >>> be unavailable for the shortest time possible, I recommend calling a
>>>>>> >>> flush on the table from the shell first and then some time later doing
>>>>>> >>> the disable. How much later you ask? Well there's currently no easy
>>>>>> >>> way to tell, I usually just tail any region server log file until I
>>>>>> >>> see they're done.
>>>>>> >>>
>>>>>> >>> J-D
>>>>>> >>>
>>>>>> >>> On Wed, Feb 16, 2011 at 2:21 PM, Nanheng Wu <na...@gmail.com>
>>>>>> wrote:
>>>>>> >>>> From time to time I run into issues where disabling a table pretty
>>>>>>
>

Re: Disabling a table taking very long time

Posted by Nanheng Wu <na...@gmail.com>.

What would happen if I try to remove the region files from hdfs and
then remove the regions from the meta table? Is hbase gonna get messed
up when it later needs to flush those regions out of the memstore?
Thanks!

On Thursday, February 24, 2011, Jean-Daniel Cryans <jd...@apache.org> wrote:
> Exactly.
>
> J-D
>
> On Thu, Feb 24, 2011 at 2:45 PM, Nanheng Wu <na...@gmail.com> wrote:
>> Sorry for trying to bring this topic back again guys, so currently in
>> 0.20.6 is there's no way to drop a table without large amount of
>> flushing?
>>
>> On Tue, Feb 22, 2011 at 3:04 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>> Moving this discussion to jira: https://issues.apache.org/jira/browse/HBASE-3557
>>>
>>> Good stuff guys.
>>>
>>> J-D
>>>
>>> On Mon, Feb 21, 2011 at 8:25 AM, Hari Sreekumar
>>> <hs...@clickable.com> wrote:
>>>> I agree, and I have missed this a lot too. We should have a way to directly
>>>> drop the table, it would help save a lot of time while developing /testing.
>>>> Maybe also have a config entry to enable this behavior, like we have
>>>> webinterface.private.actions in hadoop. It can be enabled on the dev cluster
>>>> and disabled on production etc.
>>>>
>>>> Hari
>>>>
>>>> On Mon, Feb 21, 2011 at 12:41 PM, Lars George <la...@gmail.com> wrote:
>>>>
>>>>> Could be having some "force" flag specified 3 times and asked for
>>>>> confirmation as well, but I like this feature. Whenever I talk to
>>>>> people who disable and get stuck it was to prepare a subsequent drop
>>>>> table call. So this sounds really useful given enough safety latches
>>>>> in place.
>>>>>
>>>>> Lars
>>>>>
>>>>> On Thu, Feb 17, 2011 at 1:50 AM, Jean-Daniel Cryans <jd...@apache.org>
>>>>> wrote:
>>>>> > Actually I never thought of having a special case for that... and I
>>>>> > don't see any jira about it. Would you mind opening a new one for
>>>>> > that, I think it's a good idea for those times when you're developing
>>>>> > something and you want to iterate fast.
>>>>> >
>>>>> > On the other hand, it's a pretty destructive feature so some people
>>>>> > might disagree with having it in the codebase :)
>>>>> >
>>>>> > J-D
>>>>> >
>>>>> > On Wed, Feb 16, 2011 at 4:26 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>>> >> Actually I wanted to disable the table so I can drop it. It would be
>>>>> >> nice to be able to disable the table without flushing memstore. It's
>>>>> >> not possible in 0.20.6 is it?
>>>>> >>
>>>>> >> On Wed, Feb 16, 2011 at 2:30 PM, Jean-Daniel Cryans <
>>>>> jdcryans@apache.org> wrote:
>>>>> >>> To disable a region, it's memstore must first be flushed. That's done
>>>>> >>> sequentially for all regions that must be disabled inside a table.
>>>>> >>> There's not really a way around it unless you don't need that data.
>>>>> >>>
>>>>> >>> If you're planning to disable a table but at the same time need it to
>>>>> >>> be unavailable for the shortest time possible, I recommend calling a
>>>>> >>> flush on the table from the shell first and then some time later doing
>>>>> >>> the disable. How much later you ask? Well there's currently no easy
>>>>> >>> way to tell, I usually just tail any region server log file until I
>>>>> >>> see they're done.
>>>>> >>>
>>>>> >>> J-D
>>>>> >>>
>>>>> >>> On Wed, Feb 16, 2011 at 2:21 PM, Nanheng Wu <na...@gmail.com>
>>>>> wrote:
>>>>> >>>> From time to time I run into issues where disabling a table pretty
>>>>>

Re: Disabling a table taking very long time

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Exactly.

J-D

On Thu, Feb 24, 2011 at 2:45 PM, Nanheng Wu <na...@gmail.com> wrote:
> Sorry for trying to bring this topic back again guys, so currently in
> 0.20.6 is there's no way to drop a table without large amount of
> flushing?
>
> On Tue, Feb 22, 2011 at 3:04 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> Moving this discussion to jira: https://issues.apache.org/jira/browse/HBASE-3557
>>
>> Good stuff guys.
>>
>> J-D
>>
>> On Mon, Feb 21, 2011 at 8:25 AM, Hari Sreekumar
>> <hs...@clickable.com> wrote:
>>> I agree, and I have missed this a lot too. We should have a way to directly
>>> drop the table, it would help save a lot of time while developing /testing.
>>> Maybe also have a config entry to enable this behavior, like we have
>>> webinterface.private.actions in hadoop. It can be enabled on the dev cluster
>>> and disabled on production etc.
>>>
>>> Hari
>>>
>>> On Mon, Feb 21, 2011 at 12:41 PM, Lars George <la...@gmail.com> wrote:
>>>
>>>> Could be having some "force" flag specified 3 times and asked for
>>>> confirmation as well, but I like this feature. Whenever I talk to
>>>> people who disable and get stuck it was to prepare a subsequent drop
>>>> table call. So this sounds really useful given enough safety latches
>>>> in place.
>>>>
>>>> Lars
>>>>
>>>> On Thu, Feb 17, 2011 at 1:50 AM, Jean-Daniel Cryans <jd...@apache.org>
>>>> wrote:
>>>> > Actually I never thought of having a special case for that... and I
>>>> > don't see any jira about it. Would you mind opening a new one for
>>>> > that, I think it's a good idea for those times when you're developing
>>>> > something and you want to iterate fast.
>>>> >
>>>> > On the other hand, it's a pretty destructive feature so some people
>>>> > might disagree with having it in the codebase :)
>>>> >
>>>> > J-D
>>>> >
>>>> > On Wed, Feb 16, 2011 at 4:26 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>> >> Actually I wanted to disable the table so I can drop it. It would be
>>>> >> nice to be able to disable the table without flushing memstore. It's
>>>> >> not possible in 0.20.6 is it?
>>>> >>
>>>> >> On Wed, Feb 16, 2011 at 2:30 PM, Jean-Daniel Cryans <
>>>> jdcryans@apache.org> wrote:
>>>> >>> To disable a region, it's memstore must first be flushed. That's done
>>>> >>> sequentially for all regions that must be disabled inside a table.
>>>> >>> There's not really a way around it unless you don't need that data.
>>>> >>>
>>>> >>> If you're planning to disable a table but at the same time need it to
>>>> >>> be unavailable for the shortest time possible, I recommend calling a
>>>> >>> flush on the table from the shell first and then some time later doing
>>>> >>> the disable. How much later you ask? Well there's currently no easy
>>>> >>> way to tell, I usually just tail any region server log file until I
>>>> >>> see they're done.
>>>> >>>
>>>> >>> J-D
>>>> >>>
>>>> >>> On Wed, Feb 16, 2011 at 2:21 PM, Nanheng Wu <na...@gmail.com>
>>>> wrote:
>>>> >>>> From time to time I run into issues where disabling a table pretty
>>>> >>>> much hangs. I am simply calling the disableTable method fo HBaseAdmin.
>>>> >>>> The table has ~ 500 regions with default region file size. I couldn't
>>>> >>>> tell anything abnormal from the master's log. When I click on the
>>>> >>>> region from Master's web UI I get the "RegionOfflineException". I am
>>>> >>>> using HBase 0.20.6 + Hadoop 0.20.2. My cluster has ~10 nodes with one
>>>> >>>> node running Master/ZK/NameNode. Can someone help me understand what's
>>>> >>>> wrong? Thanks!
>>>> >>>>
>>>> >>>
>>>> >>
>>>> >
>>>>
>>>
>>
>

Re: Disabling a table taking very long time

Posted by Nanheng Wu <na...@gmail.com>.

Sorry for trying to bring this topic back again guys, so currently in
0.20.6 is there's no way to drop a table without large amount of
flushing?

On Tue, Feb 22, 2011 at 3:04 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
> Moving this discussion to jira: https://issues.apache.org/jira/browse/HBASE-3557
>
> Good stuff guys.
>
> J-D
>
> On Mon, Feb 21, 2011 at 8:25 AM, Hari Sreekumar
> <hs...@clickable.com> wrote:
>> I agree, and I have missed this a lot too. We should have a way to directly
>> drop the table, it would help save a lot of time while developing /testing.
>> Maybe also have a config entry to enable this behavior, like we have
>> webinterface.private.actions in hadoop. It can be enabled on the dev cluster
>> and disabled on production etc.
>>
>> Hari
>>
>> On Mon, Feb 21, 2011 at 12:41 PM, Lars George <la...@gmail.com> wrote:
>>
>>> Could be having some "force" flag specified 3 times and asked for
>>> confirmation as well, but I like this feature. Whenever I talk to
>>> people who disable and get stuck it was to prepare a subsequent drop
>>> table call. So this sounds really useful given enough safety latches
>>> in place.
>>>
>>> Lars
>>>
>>> On Thu, Feb 17, 2011 at 1:50 AM, Jean-Daniel Cryans <jd...@apache.org>
>>> wrote:
>>> > Actually I never thought of having a special case for that... and I
>>> > don't see any jira about it. Would you mind opening a new one for
>>> > that, I think it's a good idea for those times when you're developing
>>> > something and you want to iterate fast.
>>> >
>>> > On the other hand, it's a pretty destructive feature so some people
>>> > might disagree with having it in the codebase :)
>>> >
>>> > J-D
>>> >
>>> > On Wed, Feb 16, 2011 at 4:26 PM, Nanheng Wu <na...@gmail.com> wrote:
>>> >> Actually I wanted to disable the table so I can drop it. It would be
>>> >> nice to be able to disable the table without flushing memstore. It's
>>> >> not possible in 0.20.6 is it?
>>> >>
>>> >> On Wed, Feb 16, 2011 at 2:30 PM, Jean-Daniel Cryans <
>>> jdcryans@apache.org> wrote:
>>> >>> To disable a region, it's memstore must first be flushed. That's done
>>> >>> sequentially for all regions that must be disabled inside a table.
>>> >>> There's not really a way around it unless you don't need that data.
>>> >>>
>>> >>> If you're planning to disable a table but at the same time need it to
>>> >>> be unavailable for the shortest time possible, I recommend calling a
>>> >>> flush on the table from the shell first and then some time later doing
>>> >>> the disable. How much later you ask? Well there's currently no easy
>>> >>> way to tell, I usually just tail any region server log file until I
>>> >>> see they're done.
>>> >>>
>>> >>> J-D
>>> >>>
>>> >>> On Wed, Feb 16, 2011 at 2:21 PM, Nanheng Wu <na...@gmail.com>
>>> wrote:
>>> >>>> From time to time I run into issues where disabling a table pretty
>>> >>>> much hangs. I am simply calling the disableTable method fo HBaseAdmin.
>>> >>>> The table has ~ 500 regions with default region file size. I couldn't
>>> >>>> tell anything abnormal from the master's log. When I click on the
>>> >>>> region from Master's web UI I get the "RegionOfflineException". I am
>>> >>>> using HBase 0.20.6 + Hadoop 0.20.2. My cluster has ~10 nodes with one
>>> >>>> node running Master/ZK/NameNode. Can someone help me understand what's
>>> >>>> wrong? Thanks!
>>> >>>>
>>> >>>
>>> >>
>>> >
>>>
>>
>

Re: Disabling a table taking very long time

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Moving this discussion to jira: https://issues.apache.org/jira/browse/HBASE-3557

Good stuff guys.

J-D

On Mon, Feb 21, 2011 at 8:25 AM, Hari Sreekumar
<hs...@clickable.com> wrote:
> I agree, and I have missed this a lot too. We should have a way to directly
> drop the table, it would help save a lot of time while developing /testing.
> Maybe also have a config entry to enable this behavior, like we have
> webinterface.private.actions in hadoop. It can be enabled on the dev cluster
> and disabled on production etc.
>
> Hari
>
> On Mon, Feb 21, 2011 at 12:41 PM, Lars George <la...@gmail.com> wrote:
>
>> Could be having some "force" flag specified 3 times and asked for
>> confirmation as well, but I like this feature. Whenever I talk to
>> people who disable and get stuck it was to prepare a subsequent drop
>> table call. So this sounds really useful given enough safety latches
>> in place.
>>
>> Lars
>>
>> On Thu, Feb 17, 2011 at 1:50 AM, Jean-Daniel Cryans <jd...@apache.org>
>> wrote:
>> > Actually I never thought of having a special case for that... and I
>> > don't see any jira about it. Would you mind opening a new one for
>> > that, I think it's a good idea for those times when you're developing
>> > something and you want to iterate fast.
>> >
>> > On the other hand, it's a pretty destructive feature so some people
>> > might disagree with having it in the codebase :)
>> >
>> > J-D
>> >
>> > On Wed, Feb 16, 2011 at 4:26 PM, Nanheng Wu <na...@gmail.com> wrote:
>> >> Actually I wanted to disable the table so I can drop it. It would be
>> >> nice to be able to disable the table without flushing memstore. It's
>> >> not possible in 0.20.6 is it?
>> >>
>> >> On Wed, Feb 16, 2011 at 2:30 PM, Jean-Daniel Cryans <
>> jdcryans@apache.org> wrote:
>> >>> To disable a region, it's memstore must first be flushed. That's done
>> >>> sequentially for all regions that must be disabled inside a table.
>> >>> There's not really a way around it unless you don't need that data.
>> >>>
>> >>> If you're planning to disable a table but at the same time need it to
>> >>> be unavailable for the shortest time possible, I recommend calling a
>> >>> flush on the table from the shell first and then some time later doing
>> >>> the disable. How much later you ask? Well there's currently no easy
>> >>> way to tell, I usually just tail any region server log file until I
>> >>> see they're done.
>> >>>
>> >>> J-D
>> >>>
>> >>> On Wed, Feb 16, 2011 at 2:21 PM, Nanheng Wu <na...@gmail.com>
>> wrote:
>> >>>> From time to time I run into issues where disabling a table pretty
>> >>>> much hangs. I am simply calling the disableTable method fo HBaseAdmin.
>> >>>> The table has ~ 500 regions with default region file size. I couldn't
>> >>>> tell anything abnormal from the master's log. When I click on the
>> >>>> region from Master's web UI I get the "RegionOfflineException". I am
>> >>>> using HBase 0.20.6 + Hadoop 0.20.2. My cluster has ~10 nodes with one
>> >>>> node running Master/ZK/NameNode. Can someone help me understand what's
>> >>>> wrong? Thanks!
>> >>>>
>> >>>
>> >>
>> >
>>
>

Re: Disabling a table taking very long time

Posted by Hari Sreekumar <hs...@clickable.com>.

I agree, and I have missed this a lot too. We should have a way to directly
drop the table, it would help save a lot of time while developing /testing.
Maybe also have a config entry to enable this behavior, like we have
webinterface.private.actions in hadoop. It can be enabled on the dev cluster
and disabled on production etc.

Hari

On Mon, Feb 21, 2011 at 12:41 PM, Lars George <la...@gmail.com> wrote:

> Could be having some "force" flag specified 3 times and asked for
> confirmation as well, but I like this feature. Whenever I talk to
> people who disable and get stuck it was to prepare a subsequent drop
> table call. So this sounds really useful given enough safety latches
> in place.
>
> Lars
>
> On Thu, Feb 17, 2011 at 1:50 AM, Jean-Daniel Cryans <jd...@apache.org>
> wrote:
> > Actually I never thought of having a special case for that... and I
> > don't see any jira about it. Would you mind opening a new one for
> > that, I think it's a good idea for those times when you're developing
> > something and you want to iterate fast.
> >
> > On the other hand, it's a pretty destructive feature so some people
> > might disagree with having it in the codebase :)
> >
> > J-D
> >
> > On Wed, Feb 16, 2011 at 4:26 PM, Nanheng Wu <na...@gmail.com> wrote:
> >> Actually I wanted to disable the table so I can drop it. It would be
> >> nice to be able to disable the table without flushing memstore. It's
> >> not possible in 0.20.6 is it?
> >>
> >> On Wed, Feb 16, 2011 at 2:30 PM, Jean-Daniel Cryans <
> jdcryans@apache.org> wrote:
> >>> To disable a region, it's memstore must first be flushed. That's done
> >>> sequentially for all regions that must be disabled inside a table.
> >>> There's not really a way around it unless you don't need that data.
> >>>
> >>> If you're planning to disable a table but at the same time need it to
> >>> be unavailable for the shortest time possible, I recommend calling a
> >>> flush on the table from the shell first and then some time later doing
> >>> the disable. How much later you ask? Well there's currently no easy
> >>> way to tell, I usually just tail any region server log file until I
> >>> see they're done.
> >>>
> >>> J-D
> >>>
> >>> On Wed, Feb 16, 2011 at 2:21 PM, Nanheng Wu <na...@gmail.com>
> wrote:
> >>>> From time to time I run into issues where disabling a table pretty
> >>>> much hangs. I am simply calling the disableTable method fo HBaseAdmin.
> >>>> The table has ~ 500 regions with default region file size. I couldn't
> >>>> tell anything abnormal from the master's log. When I click on the
> >>>> region from Master's web UI I get the "RegionOfflineException". I am
> >>>> using HBase 0.20.6 + Hadoop 0.20.2. My cluster has ~10 nodes with one
> >>>> node running Master/ZK/NameNode. Can someone help me understand what's
> >>>> wrong? Thanks!
> >>>>
> >>>
> >>
> >
>

Re: Disabling a table taking very long time

Posted by Lars George <la...@gmail.com>.

Could be having some "force" flag specified 3 times and asked for
confirmation as well, but I like this feature. Whenever I talk to
people who disable and get stuck it was to prepare a subsequent drop
table call. So this sounds really useful given enough safety latches
in place.

Lars

On Thu, Feb 17, 2011 at 1:50 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
> Actually I never thought of having a special case for that... and I
> don't see any jira about it. Would you mind opening a new one for
> that, I think it's a good idea for those times when you're developing
> something and you want to iterate fast.
>
> On the other hand, it's a pretty destructive feature so some people
> might disagree with having it in the codebase :)
>
> J-D
>
> On Wed, Feb 16, 2011 at 4:26 PM, Nanheng Wu <na...@gmail.com> wrote:
>> Actually I wanted to disable the table so I can drop it. It would be
>> nice to be able to disable the table without flushing memstore. It's
>> not possible in 0.20.6 is it?
>>
>> On Wed, Feb 16, 2011 at 2:30 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>> To disable a region, it's memstore must first be flushed. That's done
>>> sequentially for all regions that must be disabled inside a table.
>>> There's not really a way around it unless you don't need that data.
>>>
>>> If you're planning to disable a table but at the same time need it to
>>> be unavailable for the shortest time possible, I recommend calling a
>>> flush on the table from the shell first and then some time later doing
>>> the disable. How much later you ask? Well there's currently no easy
>>> way to tell, I usually just tail any region server log file until I
>>> see they're done.
>>>
>>> J-D
>>>
>>> On Wed, Feb 16, 2011 at 2:21 PM, Nanheng Wu <na...@gmail.com> wrote:
>>>> From time to time I run into issues where disabling a table pretty
>>>> much hangs. I am simply calling the disableTable method fo HBaseAdmin.
>>>> The table has ~ 500 regions with default region file size. I couldn't
>>>> tell anything abnormal from the master's log. When I click on the
>>>> region from Master's web UI I get the "RegionOfflineException". I am
>>>> using HBase 0.20.6 + Hadoop 0.20.2. My cluster has ~10 nodes with one
>>>> node running Master/ZK/NameNode. Can someone help me understand what's
>>>> wrong? Thanks!
>>>>
>>>
>>
>

Re: Disabling a table taking very long time

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Actually I never thought of having a special case for that... and I
don't see any jira about it. Would you mind opening a new one for
that, I think it's a good idea for those times when you're developing
something and you want to iterate fast.

On the other hand, it's a pretty destructive feature so some people
might disagree with having it in the codebase :)

J-D

On Wed, Feb 16, 2011 at 4:26 PM, Nanheng Wu <na...@gmail.com> wrote:
> Actually I wanted to disable the table so I can drop it. It would be
> nice to be able to disable the table without flushing memstore. It's
> not possible in 0.20.6 is it?
>
> On Wed, Feb 16, 2011 at 2:30 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> To disable a region, it's memstore must first be flushed. That's done
>> sequentially for all regions that must be disabled inside a table.
>> There's not really a way around it unless you don't need that data.
>>
>> If you're planning to disable a table but at the same time need it to
>> be unavailable for the shortest time possible, I recommend calling a
>> flush on the table from the shell first and then some time later doing
>> the disable. How much later you ask? Well there's currently no easy
>> way to tell, I usually just tail any region server log file until I
>> see they're done.
>>
>> J-D
>>
>> On Wed, Feb 16, 2011 at 2:21 PM, Nanheng Wu <na...@gmail.com> wrote:
>>> From time to time I run into issues where disabling a table pretty
>>> much hangs. I am simply calling the disableTable method fo HBaseAdmin.
>>> The table has ~ 500 regions with default region file size. I couldn't
>>> tell anything abnormal from the master's log. When I click on the
>>> region from Master's web UI I get the "RegionOfflineException". I am
>>> using HBase 0.20.6 + Hadoop 0.20.2. My cluster has ~10 nodes with one
>>> node running Master/ZK/NameNode. Can someone help me understand what's
>>> wrong? Thanks!
>>>
>>
>

Re: Disabling a table taking very long time

Posted by Nanheng Wu <na...@gmail.com>.

Actually I wanted to disable the table so I can drop it. It would be
nice to be able to disable the table without flushing memstore. It's
not possible in 0.20.6 is it?

On Wed, Feb 16, 2011 at 2:30 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
> To disable a region, it's memstore must first be flushed. That's done
> sequentially for all regions that must be disabled inside a table.
> There's not really a way around it unless you don't need that data.
>
> If you're planning to disable a table but at the same time need it to
> be unavailable for the shortest time possible, I recommend calling a
> flush on the table from the shell first and then some time later doing
> the disable. How much later you ask? Well there's currently no easy
> way to tell, I usually just tail any region server log file until I
> see they're done.
>
> J-D
>
> On Wed, Feb 16, 2011 at 2:21 PM, Nanheng Wu <na...@gmail.com> wrote:
>> From time to time I run into issues where disabling a table pretty
>> much hangs. I am simply calling the disableTable method fo HBaseAdmin.
>> The table has ~ 500 regions with default region file size. I couldn't
>> tell anything abnormal from the master's log. When I click on the
>> region from Master's web UI I get the "RegionOfflineException". I am
>> using HBase 0.20.6 + Hadoop 0.20.2. My cluster has ~10 nodes with one
>> node running Master/ZK/NameNode. Can someone help me understand what's
>> wrong? Thanks!
>>
>

Re: Disabling a table taking very long time

Posted by Jean-Daniel Cryans <jd...@apache.org>.

To disable a region, it's memstore must first be flushed. That's done
sequentially for all regions that must be disabled inside a table.
There's not really a way around it unless you don't need that data.

If you're planning to disable a table but at the same time need it to
be unavailable for the shortest time possible, I recommend calling a
flush on the table from the shell first and then some time later doing
the disable. How much later you ask? Well there's currently no easy
way to tell, I usually just tail any region server log file until I
see they're done.

J-D

On Wed, Feb 16, 2011 at 2:21 PM, Nanheng Wu <na...@gmail.com> wrote:
> From time to time I run into issues where disabling a table pretty
> much hangs. I am simply calling the disableTable method fo HBaseAdmin.
> The table has ~ 500 regions with default region file size. I couldn't
> tell anything abnormal from the master's log. When I click on the
> region from Master's web UI I get the "RegionOfflineException". I am
> using HBase 0.20.6 + Hadoop 0.20.2. My cluster has ~10 nodes with one
> node running Master/ZK/NameNode. Can someone help me understand what's
> wrong? Thanks!
>