You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Randy Fox <rf...@connexity.com> on 2015/10/15 19:32:33 UTC

Slow region moves

Hi,

We just upgraded from 0.94 to 1.0.0 and have noticed that region moves are super slow (order of minutes) whereas previously they where in the seconds range.  After looking at the code, I think the time is spent waiting for the blocks to be evicted from block cache.

I wanted to verify that this theory is correct and see if there is anything that can be done to speed up the moves.

This is particular painful as we are trying to get our configs tuned to the new SW and need to do rolling restarts which is taking almost 24 hours on our cluster.  We also do our own manual rebalancing of regions across RS’s and that task is also now painful.


Thanks,

Randy

Re: Slow region moves

Posted by Randy Fox <rf...@connexity.com>.

Thanks a ton for the quick turn around.  After one more slow rolling restart and we will be quite happy.

Cheers,

Randy




On 10/22/15, 10:38 AM, "Vladimir Rodionov" <vl...@gmail.com> wrote:

>Good to know :)
>
>-Vlad
>
>On Thu, Oct 22, 2015 at 9:40 AM, Randy Fox <rf...@connexity.com> wrote:
>
>> Hi Vlad,
>>
>> So far patch seems to work perfectly.
>>
>> -randy
>>
>>
>>
>>
>> On 10/21/15, 12:52 PM, "Vladimir Rodionov" <vl...@gmail.com> wrote:
>>
>> >Randy,
>> >
>> >You can try patch I just submitted. It is for master but I verified it on
>> >1.0 branch as well.
>> >
>> >-Vlad
>> >
>> >On Wed, Oct 21, 2015 at 11:40 AM, Randy Fox <rf...@connexity.com> wrote:
>> >
>> >> https://issues.apache.org/jira/browse/HBASE-14663
>> >>
>> >> -r
>> >>
>> >>
>> >>
>> >> On 10/21/15, 10:35 AM, "Vladimir Rodionov" <vl...@gmail.com>
>> wrote:
>> >>
>> >> >You are right, Randy
>> >> >
>> >> >This is the bug. Will you open JIRA?
>> >> >
>> >> >-Vlad
>> >> >
>> >> >On Wed, Oct 21, 2015 at 9:35 AM, Randy Fox <rf...@connexity.com> wrote:
>> >> >
>> >> >> Maybe I am looking in the wrong place but Hstore::close() has the
>> >> >> evictOnClose parameter hard coded to true:
>> >> >>
>> >> >> // close each store file in parallel
>> >> >> CompletionService<Void> completionService =
>> >> >>   new ExecutorCompletionService<Void>(storeFileCloserThreadPool);
>> >> >> for (final StoreFile f : result) {
>> >> >>   completionService.submit(new Callable<Void>() {
>> >> >>     @Override
>> >> >>     public Void call() throws IOException {
>> >> >>       f.closeReader(true);
>> >> >>       return null;
>> >> >>     }
>> >> >>   });
>> >> >> }
>> >> >>
>> >> >>
>> >> >> Where does that setting come into play?
>> >> >>
>> >> >> -r
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On 10/21/15, 8:14 AM, "Vladimir Rodionov" <vl...@gmail.com>
>> >> wrote:
>> >> >>
>> >> >> >I wonder why disabling cache eviction on close does not work in a
>> case
>> >> of
>> >> >> a
>> >> >> >bucket cache? I checked the code and did not find
>> >> >> >anything suspicious.  It has to work.
>> >> >> >
>> >> >> >On Wed, Oct 21, 2015 at 3:52 AM, ramkrishna vasudevan <
>> >> >> >ramkrishna.s.vasudevan@gmail.com> wrote:
>> >> >> >
>> >> >> >> Seems that the BucketAllocator#freeBlock() is synchronized and
>> hence
>> >> all
>> >> >> >> the bulk close that it tries to do will be blocked in the
>> >> synchronized
>> >> >> >> block.  May be something like the IdLock has to be tried here?
>> >> >> >>
>> >> >> >> Regards
>> >> >> >> Ram
>> >> >> >>
>> >> >> >> On Wed, Oct 21, 2015 at 4:20 PM, ramkrishna vasudevan <
>> >> >> >> ramkrishna.s.vasudevan@gmail.com> wrote:
>> >> >> >>
>> >> >> >> > I think the forceful clearing of the blocks from the bucket
>> cache
>> >> is
>> >> >> >> > hurting in this case.  I think it is worth opening a JIRA for
>> this
>> >> and
>> >> >> >> work
>> >> >> >> > on a fix.
>> >> >> >> >
>> >> >> >> > Regards
>> >> >> >> > Ram
>> >> >> >> >
>> >> >> >> > On Wed, Oct 21, 2015 at 12:00 AM, Randy Fox <rfox@connexity.com
>> >
>> >> >> wrote:
>> >> >> >> >
>> >> >> >> >> Hi Vlad,
>> >> >> >> >>
>> >> >> >> >> I tried it on a table and on a RegionServer basis and it
>> appears
>> >> to
>> >> >> have
>> >> >> >> >> no affect.
>> >> >> >> >> Are we sure it is supported for bucket cache?  From my charts
>> the
>> >> >> bucket
>> >> >> >> >> cache is getting cleared at the same time as the region moves
>> >> >> occurred.
>> >> >> >> >> The regions slow to move are the ones with bucket cache.
>> >> >> >> >>
>> >> >> >> >> I took a table with 102 regions and blockcache true and turned
>> off
>> >> >> block
>> >> >> >> >> cache via alter while the table is enabled - it took 19
>> minutes.
>> >> To
>> >> >> >> turn
>> >> >> >> >> block cache back on took 4.3 seconds.
>> >> >> >> >>
>> >> >> >> >> Let me know if there is anything else to try.  This issue is
>> >> really
>> >> >> >> >> hurting our day to day ops.
>> >> >> >> >>
>> >> >> >> >> Thanks,
>> >> >> >> >>
>> >> >> >> >> Randy
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On 10/15/15, 3:55 PM, "Vladimir Rodionov" <
>> vladrodionov@gmail.com
>> >> >
>> >> >> >> wrote:
>> >> >> >> >>
>> >> >> >> >> >Hey, Randy
>> >> >> >> >> >
>> >> >> >> >> >You can verify your hypothesis by setting
>> >> >> hbase.rs.evictblocksonclose
>> >> >> >> to
>> >> >> >> >> >false for your tables.
>> >> >> >> >> >
>> >> >> >> >> >-Vlad
>> >> >> >> >> >
>> >> >> >> >> >On Thu, Oct 15, 2015 at 1:06 PM, Randy Fox <
>> rfox@connexity.com>
>> >> >> wrote:
>> >> >> >> >> >
>> >> >> >> >> >> Caveat - we are trying to tune the BucketCache (probably a
>> new
>> >> >> thread
>> >> >> >> >> - as
>> >> >> >> >> >> we are not sure we are getting the most out of it)
>> >> >> >> >> >> 72G off heap
>> >> >> >> >> >>
>> >> >> >> >> >> <property>
>> >> >> >> >> >>    <name>hfile.block.cache.size</name>
>> >> >> >> >> >>    <value>0.58</value>
>> >> >> >> >> >> </property>
>> >> >> >> >> >>
>> >> >> >> >> >> <property>
>> >> >> >> >> >>    <name>hbase.bucketcache.ioengine</name>
>> >> >> >> >> >>    <value>offheap</value>
>> >> >> >> >> >> </property>
>> >> >> >> >> >>
>> >> >> >> >> >> <property>
>> >> >> >> >> >>    <name>hbase.bucketcache.size</name>
>> >> >> >> >> >>    <value>72800</value>
>> >> >> >> >> >> </property>
>> >> >> >> >> >>
>> >> >> >> >> >> <property>
>> >> >> >> >> >>    <name>hbase.bucketcache.bucket.sizes</name>
>> >> >> >> >> >>    <value>9216,17408,33792,66560</value>
>> >> >> >> >> >> </property>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> On 10/15/15, 12:00 PM, "Ted Yu" <yu...@gmail.com>
>> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> >I am a bit curious.
>> >> >> >> >> >> >0.94 doesn't have BucketCache.
>> >> >> >> >> >> >
>> >> >> >> >> >> >Can you share BucketCache related config parameters in your
>> >> >> cluster
>> >> >> >> ?
>> >> >> >> >> >> >
>> >> >> >> >> >> >Cheers
>> >> >> >> >> >> >
>> >> >> >> >> >> >On Thu, Oct 15, 2015 at 11:11 AM, Randy Fox <
>> >> rfox@connexity.com>
>> >> >> >> >> wrote:
>> >> >> >> >> >> >
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> "StoreFileCloserThread-L-1" prio=10
>> tid=0x00000000027ec800
>> >> >> >> >> nid=0xad84
>> >> >> >> >> >> >> runnable [0x00007fbcc0c65000]
>> >> >> >> >> >> >>    java.lang.Thread.State: RUNNABLE
>> >> >> >> >> >> >>         at
>> java.util.LinkedList.indexOf(LinkedList.java:602)
>> >> >> >> >> >> >>         at
>> >> java.util.LinkedList.contains(LinkedList.java:315)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator$BucketSizeInfo.freeBlock(BucketAllocator.java:247)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.freeBlock(BucketAllocator.java:449)
>> >> >> >> >> >> >>         - locked <0x000000041b0887a8> (a
>> >> >> >> >> >> >> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlock(BucketCache.java:459)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlocksByHfileName(BucketCache.java:1036)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.evictBlocksByHfileName(CombinedBlockCache.java:90)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.close(HFileReaderV2.java:516)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.close(StoreFile.java:1143)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.regionserver.StoreFile.closeReader(StoreFile.java:503)
>> >> >> >> >> >> >>         - locked <0x00000004944ff2d8> (a
>> >> >> >> >> >> >> org.apache.hadoop.hbase.regionserver.StoreFile)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >>
>> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:873)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >>
>> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:870)
>> >> >> >> >> >> >>         at
>> >> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >>
>> >> >>
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> >> >> >> >> >> >>         at
>> >> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> "StoreCloserThread-Wildfire_graph3,\x00\x04lK\x1B\xFC\x10\xD2,1402949830657.afb6a1720d936a83d73022aeb9ddbb6c.-1"
>> >> >> >> >> >> >> prio=10 tid=0x0000000003508800 nid=0xad83 waiting on
>> >> condition
>> >> >> >> >> >> >> [0x00007fbcc5dcc000]
>> >> >> >> >> >> >>    java.lang.Thread.State: WAITING (parking)
>> >> >> >> >> >> >>         at sun.misc.Unsafe.park(Native Method)
>> >> >> >> >> >> >>         - parking to wait for  <0x0000000534e90a80> (a
>> >> >> >> >> >> >>
>> >> >> >> >>
>> >> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:883)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:126)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >>
>> >> >>
>> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1378)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >>
>> >> >>
>> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1375)
>> >> >> >> >> >> >>         at
>> >> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >>
>> >> >>
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> >> >> >> >> >> >>         at
>> >> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> "RS_CLOSE_REGION-hb20:60020-0" prio=10
>> >> tid=0x00007fcec0142000
>> >> >> >> >> nid=0x3056
>> >> >> >> >> >> >> waiting on condition [0x00007fbcc2d87000]
>> >> >> >> >> >> >>    java.lang.Thread.State: WAITING (parking)
>> >> >> >> >> >> >>         at sun.misc.Unsafe.park(Native Method)
>> >> >> >> >> >> >>         - parking to wait for  <0x0000000534e61360> (a
>> >> >> >> >> >> >>
>> >> >> >> >>
>> >> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >>
>> >> >>
>> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1385)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >>
>> >> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1280)
>> >> >> >> >> >> >>         - locked <0x000000042230fa68> (a
>> java.lang.Object)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >>
>> >> >>
>> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >> >> >> >> >>         at
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> I attached the whole thing as well.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> -r
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> On 10/15/15, 10:39 AM, "Ted Yu" <yu...@gmail.com>
>> >> wrote:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> >Can you give a bit more detail on why block eviction was
>> >> cause
>> >> >> >> for
>> >> >> >> >> the
>> >> >> >> >> >> >> slow region movement?
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >Did you happen to take stack traces ?
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >Thanks
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> On Oct 15, 2015, at 10:32 AM, Randy Fox <
>> >> rfox@connexity.com
>> >> >> >
>> >> >> >> >> wrote:
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Hi,
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> We just upgraded from 0.94 to 1.0.0 and have noticed
>> that
>> >> >> >> region
>> >> >> >> >> >> moves
>> >> >> >> >> >> >> are super slow (order of minutes) whereas previously they
>> >> >> where in
>> >> >> >> >> the
>> >> >> >> >> >> >> seconds range.  After looking at the code, I think the
>> time
>> >> is
>> >> >> >> spent
>> >> >> >> >> >> >> waiting for the blocks to be evicted from block cache.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> I wanted to verify that this theory is correct and
>> see if
>> >> >> there
>> >> >> >> >> is
>> >> >> >> >> >> >> anything that can be done to speed up the moves.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> This is particular painful as we are trying to get our
>> >> >> configs
>> >> >> >> >> tuned
>> >> >> >> >> >> to
>> >> >> >> >> >> >> the new SW and need to do rolling restarts which is
>> taking
>> >> >> almost
>> >> >> >> 24
>> >> >> >> >> >> hours
>> >> >> >> >> >> >> on our cluster.  We also do our own manual rebalancing of
>> >> >> regions
>> >> >> >> >> across
>> >> >> >> >> >> >> RS’s and that task is also now painful.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Thanks,
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Randy
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>>

Re: Slow region moves

Posted by Vladimir Rodionov <vl...@gmail.com>.

Good to know :)

-Vlad

On Thu, Oct 22, 2015 at 9:40 AM, Randy Fox <rf...@connexity.com> wrote:

> Hi Vlad,
>
> So far patch seems to work perfectly.
>
> -randy
>
>
>
>
> On 10/21/15, 12:52 PM, "Vladimir Rodionov" <vl...@gmail.com> wrote:
>
> >Randy,
> >
> >You can try patch I just submitted. It is for master but I verified it on
> >1.0 branch as well.
> >
> >-Vlad
> >
> >On Wed, Oct 21, 2015 at 11:40 AM, Randy Fox <rf...@connexity.com> wrote:
> >
> >> https://issues.apache.org/jira/browse/HBASE-14663
> >>
> >> -r
> >>
> >>
> >>
> >> On 10/21/15, 10:35 AM, "Vladimir Rodionov" <vl...@gmail.com>
> wrote:
> >>
> >> >You are right, Randy
> >> >
> >> >This is the bug. Will you open JIRA?
> >> >
> >> >-Vlad
> >> >
> >> >On Wed, Oct 21, 2015 at 9:35 AM, Randy Fox <rf...@connexity.com> wrote:
> >> >
> >> >> Maybe I am looking in the wrong place but Hstore::close() has the
> >> >> evictOnClose parameter hard coded to true:
> >> >>
> >> >> // close each store file in parallel
> >> >> CompletionService<Void> completionService =
> >> >>   new ExecutorCompletionService<Void>(storeFileCloserThreadPool);
> >> >> for (final StoreFile f : result) {
> >> >>   completionService.submit(new Callable<Void>() {
> >> >>     @Override
> >> >>     public Void call() throws IOException {
> >> >>       f.closeReader(true);
> >> >>       return null;
> >> >>     }
> >> >>   });
> >> >> }
> >> >>
> >> >>
> >> >> Where does that setting come into play?
> >> >>
> >> >> -r
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On 10/21/15, 8:14 AM, "Vladimir Rodionov" <vl...@gmail.com>
> >> wrote:
> >> >>
> >> >> >I wonder why disabling cache eviction on close does not work in a
> case
> >> of
> >> >> a
> >> >> >bucket cache? I checked the code and did not find
> >> >> >anything suspicious.  It has to work.
> >> >> >
> >> >> >On Wed, Oct 21, 2015 at 3:52 AM, ramkrishna vasudevan <
> >> >> >ramkrishna.s.vasudevan@gmail.com> wrote:
> >> >> >
> >> >> >> Seems that the BucketAllocator#freeBlock() is synchronized and
> hence
> >> all
> >> >> >> the bulk close that it tries to do will be blocked in the
> >> synchronized
> >> >> >> block.  May be something like the IdLock has to be tried here?
> >> >> >>
> >> >> >> Regards
> >> >> >> Ram
> >> >> >>
> >> >> >> On Wed, Oct 21, 2015 at 4:20 PM, ramkrishna vasudevan <
> >> >> >> ramkrishna.s.vasudevan@gmail.com> wrote:
> >> >> >>
> >> >> >> > I think the forceful clearing of the blocks from the bucket
> cache
> >> is
> >> >> >> > hurting in this case.  I think it is worth opening a JIRA for
> this
> >> and
> >> >> >> work
> >> >> >> > on a fix.
> >> >> >> >
> >> >> >> > Regards
> >> >> >> > Ram
> >> >> >> >
> >> >> >> > On Wed, Oct 21, 2015 at 12:00 AM, Randy Fox <rfox@connexity.com
> >
> >> >> wrote:
> >> >> >> >
> >> >> >> >> Hi Vlad,
> >> >> >> >>
> >> >> >> >> I tried it on a table and on a RegionServer basis and it
> appears
> >> to
> >> >> have
> >> >> >> >> no affect.
> >> >> >> >> Are we sure it is supported for bucket cache?  From my charts
> the
> >> >> bucket
> >> >> >> >> cache is getting cleared at the same time as the region moves
> >> >> occurred.
> >> >> >> >> The regions slow to move are the ones with bucket cache.
> >> >> >> >>
> >> >> >> >> I took a table with 102 regions and blockcache true and turned
> off
> >> >> block
> >> >> >> >> cache via alter while the table is enabled - it took 19
> minutes.
> >> To
> >> >> >> turn
> >> >> >> >> block cache back on took 4.3 seconds.
> >> >> >> >>
> >> >> >> >> Let me know if there is anything else to try.  This issue is
> >> really
> >> >> >> >> hurting our day to day ops.
> >> >> >> >>
> >> >> >> >> Thanks,
> >> >> >> >>
> >> >> >> >> Randy
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On 10/15/15, 3:55 PM, "Vladimir Rodionov" <
> vladrodionov@gmail.com
> >> >
> >> >> >> wrote:
> >> >> >> >>
> >> >> >> >> >Hey, Randy
> >> >> >> >> >
> >> >> >> >> >You can verify your hypothesis by setting
> >> >> hbase.rs.evictblocksonclose
> >> >> >> to
> >> >> >> >> >false for your tables.
> >> >> >> >> >
> >> >> >> >> >-Vlad
> >> >> >> >> >
> >> >> >> >> >On Thu, Oct 15, 2015 at 1:06 PM, Randy Fox <
> rfox@connexity.com>
> >> >> wrote:
> >> >> >> >> >
> >> >> >> >> >> Caveat - we are trying to tune the BucketCache (probably a
> new
> >> >> thread
> >> >> >> >> - as
> >> >> >> >> >> we are not sure we are getting the most out of it)
> >> >> >> >> >> 72G off heap
> >> >> >> >> >>
> >> >> >> >> >> <property>
> >> >> >> >> >>    <name>hfile.block.cache.size</name>
> >> >> >> >> >>    <value>0.58</value>
> >> >> >> >> >> </property>
> >> >> >> >> >>
> >> >> >> >> >> <property>
> >> >> >> >> >>    <name>hbase.bucketcache.ioengine</name>
> >> >> >> >> >>    <value>offheap</value>
> >> >> >> >> >> </property>
> >> >> >> >> >>
> >> >> >> >> >> <property>
> >> >> >> >> >>    <name>hbase.bucketcache.size</name>
> >> >> >> >> >>    <value>72800</value>
> >> >> >> >> >> </property>
> >> >> >> >> >>
> >> >> >> >> >> <property>
> >> >> >> >> >>    <name>hbase.bucketcache.bucket.sizes</name>
> >> >> >> >> >>    <value>9216,17408,33792,66560</value>
> >> >> >> >> >> </property>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> On 10/15/15, 12:00 PM, "Ted Yu" <yu...@gmail.com>
> wrote:
> >> >> >> >> >>
> >> >> >> >> >> >I am a bit curious.
> >> >> >> >> >> >0.94 doesn't have BucketCache.
> >> >> >> >> >> >
> >> >> >> >> >> >Can you share BucketCache related config parameters in your
> >> >> cluster
> >> >> >> ?
> >> >> >> >> >> >
> >> >> >> >> >> >Cheers
> >> >> >> >> >> >
> >> >> >> >> >> >On Thu, Oct 15, 2015 at 11:11 AM, Randy Fox <
> >> rfox@connexity.com>
> >> >> >> >> wrote:
> >> >> >> >> >> >
> >> >> >> >> >> >>
> >> >> >> >> >> >> "StoreFileCloserThread-L-1" prio=10
> tid=0x00000000027ec800
> >> >> >> >> nid=0xad84
> >> >> >> >> >> >> runnable [0x00007fbcc0c65000]
> >> >> >> >> >> >>    java.lang.Thread.State: RUNNABLE
> >> >> >> >> >> >>         at
> java.util.LinkedList.indexOf(LinkedList.java:602)
> >> >> >> >> >> >>         at
> >> java.util.LinkedList.contains(LinkedList.java:315)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator$BucketSizeInfo.freeBlock(BucketAllocator.java:247)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.freeBlock(BucketAllocator.java:449)
> >> >> >> >> >> >>         - locked <0x000000041b0887a8> (a
> >> >> >> >> >> >> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlock(BucketCache.java:459)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlocksByHfileName(BucketCache.java:1036)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.evictBlocksByHfileName(CombinedBlockCache.java:90)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.close(HFileReaderV2.java:516)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.close(StoreFile.java:1143)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.regionserver.StoreFile.closeReader(StoreFile.java:503)
> >> >> >> >> >> >>         - locked <0x00000004944ff2d8> (a
> >> >> >> >> >> >> org.apache.hadoop.hbase.regionserver.StoreFile)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >>
> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:873)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >>
> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:870)
> >> >> >> >> >> >>         at
> >> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >>
> >> >>
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >> >> >> >> >> >>         at
> >> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> "StoreCloserThread-Wildfire_graph3,\x00\x04lK\x1B\xFC\x10\xD2,1402949830657.afb6a1720d936a83d73022aeb9ddbb6c.-1"
> >> >> >> >> >> >> prio=10 tid=0x0000000003508800 nid=0xad83 waiting on
> >> condition
> >> >> >> >> >> >> [0x00007fbcc5dcc000]
> >> >> >> >> >> >>    java.lang.Thread.State: WAITING (parking)
> >> >> >> >> >> >>         at sun.misc.Unsafe.park(Native Method)
> >> >> >> >> >> >>         - parking to wait for  <0x0000000534e90a80> (a
> >> >> >> >> >> >>
> >> >> >> >>
> >> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:883)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:126)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >>
> >> >>
> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1378)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >>
> >> >>
> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1375)
> >> >> >> >> >> >>         at
> >> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >>
> >> >>
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >> >> >> >> >> >>         at
> >> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >> "RS_CLOSE_REGION-hb20:60020-0" prio=10
> >> tid=0x00007fcec0142000
> >> >> >> >> nid=0x3056
> >> >> >> >> >> >> waiting on condition [0x00007fbcc2d87000]
> >> >> >> >> >> >>    java.lang.Thread.State: WAITING (parking)
> >> >> >> >> >> >>         at sun.misc.Unsafe.park(Native Method)
> >> >> >> >> >> >>         - parking to wait for  <0x0000000534e61360> (a
> >> >> >> >> >> >>
> >> >> >> >>
> >> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >>
> >> >>
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1385)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >>
> >> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1280)
> >> >> >> >> >> >>         - locked <0x000000042230fa68> (a
> java.lang.Object)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >>
> >> >>
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >> >> >> >> >>         at
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >> I attached the whole thing as well.
> >> >> >> >> >> >>
> >> >> >> >> >> >> -r
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >> On 10/15/15, 10:39 AM, "Ted Yu" <yu...@gmail.com>
> >> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> >Can you give a bit more detail on why block eviction was
> >> cause
> >> >> >> for
> >> >> >> >> the
> >> >> >> >> >> >> slow region movement?
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >Did you happen to take stack traces ?
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >Thanks
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> On Oct 15, 2015, at 10:32 AM, Randy Fox <
> >> rfox@connexity.com
> >> >> >
> >> >> >> >> wrote:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Hi,
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> We just upgraded from 0.94 to 1.0.0 and have noticed
> that
> >> >> >> region
> >> >> >> >> >> moves
> >> >> >> >> >> >> are super slow (order of minutes) whereas previously they
> >> >> where in
> >> >> >> >> the
> >> >> >> >> >> >> seconds range.  After looking at the code, I think the
> time
> >> is
> >> >> >> spent
> >> >> >> >> >> >> waiting for the blocks to be evicted from block cache.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> I wanted to verify that this theory is correct and
> see if
> >> >> there
> >> >> >> >> is
> >> >> >> >> >> >> anything that can be done to speed up the moves.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> This is particular painful as we are trying to get our
> >> >> configs
> >> >> >> >> tuned
> >> >> >> >> >> to
> >> >> >> >> >> >> the new SW and need to do rolling restarts which is
> taking
> >> >> almost
> >> >> >> 24
> >> >> >> >> >> hours
> >> >> >> >> >> >> on our cluster.  We also do our own manual rebalancing of
> >> >> regions
> >> >> >> >> across
> >> >> >> >> >> >> RS’s and that task is also now painful.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Thanks,
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Randy
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >>
> >> >>
> >>
>

Re: Slow region moves

Posted by Randy Fox <rf...@connexity.com>.

Hi Vlad,

So far patch seems to work perfectly.

-randy




On 10/21/15, 12:52 PM, "Vladimir Rodionov" <vl...@gmail.com> wrote:

>Randy,
>
>You can try patch I just submitted. It is for master but I verified it on
>1.0 branch as well.
>
>-Vlad
>
>On Wed, Oct 21, 2015 at 11:40 AM, Randy Fox <rf...@connexity.com> wrote:
>
>> https://issues.apache.org/jira/browse/HBASE-14663
>>
>> -r
>>
>>
>>
>> On 10/21/15, 10:35 AM, "Vladimir Rodionov" <vl...@gmail.com> wrote:
>>
>> >You are right, Randy
>> >
>> >This is the bug. Will you open JIRA?
>> >
>> >-Vlad
>> >
>> >On Wed, Oct 21, 2015 at 9:35 AM, Randy Fox <rf...@connexity.com> wrote:
>> >
>> >> Maybe I am looking in the wrong place but Hstore::close() has the
>> >> evictOnClose parameter hard coded to true:
>> >>
>> >> // close each store file in parallel
>> >> CompletionService<Void> completionService =
>> >>   new ExecutorCompletionService<Void>(storeFileCloserThreadPool);
>> >> for (final StoreFile f : result) {
>> >>   completionService.submit(new Callable<Void>() {
>> >>     @Override
>> >>     public Void call() throws IOException {
>> >>       f.closeReader(true);
>> >>       return null;
>> >>     }
>> >>   });
>> >> }
>> >>
>> >>
>> >> Where does that setting come into play?
>> >>
>> >> -r
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On 10/21/15, 8:14 AM, "Vladimir Rodionov" <vl...@gmail.com>
>> wrote:
>> >>
>> >> >I wonder why disabling cache eviction on close does not work in a case
>> of
>> >> a
>> >> >bucket cache? I checked the code and did not find
>> >> >anything suspicious.  It has to work.
>> >> >
>> >> >On Wed, Oct 21, 2015 at 3:52 AM, ramkrishna vasudevan <
>> >> >ramkrishna.s.vasudevan@gmail.com> wrote:
>> >> >
>> >> >> Seems that the BucketAllocator#freeBlock() is synchronized and hence
>> all
>> >> >> the bulk close that it tries to do will be blocked in the
>> synchronized
>> >> >> block.  May be something like the IdLock has to be tried here?
>> >> >>
>> >> >> Regards
>> >> >> Ram
>> >> >>
>> >> >> On Wed, Oct 21, 2015 at 4:20 PM, ramkrishna vasudevan <
>> >> >> ramkrishna.s.vasudevan@gmail.com> wrote:
>> >> >>
>> >> >> > I think the forceful clearing of the blocks from the bucket cache
>> is
>> >> >> > hurting in this case.  I think it is worth opening a JIRA for this
>> and
>> >> >> work
>> >> >> > on a fix.
>> >> >> >
>> >> >> > Regards
>> >> >> > Ram
>> >> >> >
>> >> >> > On Wed, Oct 21, 2015 at 12:00 AM, Randy Fox <rf...@connexity.com>
>> >> wrote:
>> >> >> >
>> >> >> >> Hi Vlad,
>> >> >> >>
>> >> >> >> I tried it on a table and on a RegionServer basis and it appears
>> to
>> >> have
>> >> >> >> no affect.
>> >> >> >> Are we sure it is supported for bucket cache?  From my charts the
>> >> bucket
>> >> >> >> cache is getting cleared at the same time as the region moves
>> >> occurred.
>> >> >> >> The regions slow to move are the ones with bucket cache.
>> >> >> >>
>> >> >> >> I took a table with 102 regions and blockcache true and turned off
>> >> block
>> >> >> >> cache via alter while the table is enabled - it took 19 minutes.
>> To
>> >> >> turn
>> >> >> >> block cache back on took 4.3 seconds.
>> >> >> >>
>> >> >> >> Let me know if there is anything else to try.  This issue is
>> really
>> >> >> >> hurting our day to day ops.
>> >> >> >>
>> >> >> >> Thanks,
>> >> >> >>
>> >> >> >> Randy
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> On 10/15/15, 3:55 PM, "Vladimir Rodionov" <vladrodionov@gmail.com
>> >
>> >> >> wrote:
>> >> >> >>
>> >> >> >> >Hey, Randy
>> >> >> >> >
>> >> >> >> >You can verify your hypothesis by setting
>> >> hbase.rs.evictblocksonclose
>> >> >> to
>> >> >> >> >false for your tables.
>> >> >> >> >
>> >> >> >> >-Vlad
>> >> >> >> >
>> >> >> >> >On Thu, Oct 15, 2015 at 1:06 PM, Randy Fox <rf...@connexity.com>
>> >> wrote:
>> >> >> >> >
>> >> >> >> >> Caveat - we are trying to tune the BucketCache (probably a new
>> >> thread
>> >> >> >> - as
>> >> >> >> >> we are not sure we are getting the most out of it)
>> >> >> >> >> 72G off heap
>> >> >> >> >>
>> >> >> >> >> <property>
>> >> >> >> >>    <name>hfile.block.cache.size</name>
>> >> >> >> >>    <value>0.58</value>
>> >> >> >> >> </property>
>> >> >> >> >>
>> >> >> >> >> <property>
>> >> >> >> >>    <name>hbase.bucketcache.ioengine</name>
>> >> >> >> >>    <value>offheap</value>
>> >> >> >> >> </property>
>> >> >> >> >>
>> >> >> >> >> <property>
>> >> >> >> >>    <name>hbase.bucketcache.size</name>
>> >> >> >> >>    <value>72800</value>
>> >> >> >> >> </property>
>> >> >> >> >>
>> >> >> >> >> <property>
>> >> >> >> >>    <name>hbase.bucketcache.bucket.sizes</name>
>> >> >> >> >>    <value>9216,17408,33792,66560</value>
>> >> >> >> >> </property>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On 10/15/15, 12:00 PM, "Ted Yu" <yu...@gmail.com> wrote:
>> >> >> >> >>
>> >> >> >> >> >I am a bit curious.
>> >> >> >> >> >0.94 doesn't have BucketCache.
>> >> >> >> >> >
>> >> >> >> >> >Can you share BucketCache related config parameters in your
>> >> cluster
>> >> >> ?
>> >> >> >> >> >
>> >> >> >> >> >Cheers
>> >> >> >> >> >
>> >> >> >> >> >On Thu, Oct 15, 2015 at 11:11 AM, Randy Fox <
>> rfox@connexity.com>
>> >> >> >> wrote:
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> "StoreFileCloserThread-L-1" prio=10 tid=0x00000000027ec800
>> >> >> >> nid=0xad84
>> >> >> >> >> >> runnable [0x00007fbcc0c65000]
>> >> >> >> >> >>    java.lang.Thread.State: RUNNABLE
>> >> >> >> >> >>         at java.util.LinkedList.indexOf(LinkedList.java:602)
>> >> >> >> >> >>         at
>> java.util.LinkedList.contains(LinkedList.java:315)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator$BucketSizeInfo.freeBlock(BucketAllocator.java:247)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.freeBlock(BucketAllocator.java:449)
>> >> >> >> >> >>         - locked <0x000000041b0887a8> (a
>> >> >> >> >> >> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlock(BucketCache.java:459)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlocksByHfileName(BucketCache.java:1036)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.evictBlocksByHfileName(CombinedBlockCache.java:90)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.close(HFileReaderV2.java:516)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.close(StoreFile.java:1143)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.regionserver.StoreFile.closeReader(StoreFile.java:503)
>> >> >> >> >> >>         - locked <0x00000004944ff2d8> (a
>> >> >> >> >> >> org.apache.hadoop.hbase.regionserver.StoreFile)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:873)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:870)
>> >> >> >> >> >>         at
>> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >>
>> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> >> >> >> >> >>         at
>> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> "StoreCloserThread-Wildfire_graph3,\x00\x04lK\x1B\xFC\x10\xD2,1402949830657.afb6a1720d936a83d73022aeb9ddbb6c.-1"
>> >> >> >> >> >> prio=10 tid=0x0000000003508800 nid=0xad83 waiting on
>> condition
>> >> >> >> >> >> [0x00007fbcc5dcc000]
>> >> >> >> >> >>    java.lang.Thread.State: WAITING (parking)
>> >> >> >> >> >>         at sun.misc.Unsafe.park(Native Method)
>> >> >> >> >> >>         - parking to wait for  <0x0000000534e90a80> (a
>> >> >> >> >> >>
>> >> >> >>
>> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:883)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:126)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >>
>> >> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1378)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >>
>> >> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1375)
>> >> >> >> >> >>         at
>> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >>
>> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> >> >> >> >> >>         at
>> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> "RS_CLOSE_REGION-hb20:60020-0" prio=10
>> tid=0x00007fcec0142000
>> >> >> >> nid=0x3056
>> >> >> >> >> >> waiting on condition [0x00007fbcc2d87000]
>> >> >> >> >> >>    java.lang.Thread.State: WAITING (parking)
>> >> >> >> >> >>         at sun.misc.Unsafe.park(Native Method)
>> >> >> >> >> >>         - parking to wait for  <0x0000000534e61360> (a
>> >> >> >> >> >>
>> >> >> >>
>> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >>
>> >> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1385)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >>
>> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1280)
>> >> >> >> >> >>         - locked <0x000000042230fa68> (a java.lang.Object)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >>
>> >> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >> >> >> >>         at
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> I attached the whole thing as well.
>> >> >> >> >> >>
>> >> >> >> >> >> -r
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> On 10/15/15, 10:39 AM, "Ted Yu" <yu...@gmail.com>
>> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> >Can you give a bit more detail on why block eviction was
>> cause
>> >> >> for
>> >> >> >> the
>> >> >> >> >> >> slow region movement?
>> >> >> >> >> >> >
>> >> >> >> >> >> >Did you happen to take stack traces ?
>> >> >> >> >> >> >
>> >> >> >> >> >> >Thanks
>> >> >> >> >> >> >
>> >> >> >> >> >> >> On Oct 15, 2015, at 10:32 AM, Randy Fox <
>> rfox@connexity.com
>> >> >
>> >> >> >> wrote:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Hi,
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> We just upgraded from 0.94 to 1.0.0 and have noticed that
>> >> >> region
>> >> >> >> >> moves
>> >> >> >> >> >> are super slow (order of minutes) whereas previously they
>> >> where in
>> >> >> >> the
>> >> >> >> >> >> seconds range.  After looking at the code, I think the time
>> is
>> >> >> spent
>> >> >> >> >> >> waiting for the blocks to be evicted from block cache.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> I wanted to verify that this theory is correct and see if
>> >> there
>> >> >> >> is
>> >> >> >> >> >> anything that can be done to speed up the moves.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> This is particular painful as we are trying to get our
>> >> configs
>> >> >> >> tuned
>> >> >> >> >> to
>> >> >> >> >> >> the new SW and need to do rolling restarts which is taking
>> >> almost
>> >> >> 24
>> >> >> >> >> hours
>> >> >> >> >> >> on our cluster.  We also do our own manual rebalancing of
>> >> regions
>> >> >> >> across
>> >> >> >> >> >> RS’s and that task is also now painful.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Thanks,
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Randy
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >>
>> >>
>>

Re: Slow region moves

Posted by Vladimir Rodionov <vl...@gmail.com>.

Randy,

You can try patch I just submitted. It is for master but I verified it on
1.0 branch as well.

-Vlad

On Wed, Oct 21, 2015 at 11:40 AM, Randy Fox <rf...@connexity.com> wrote:

> https://issues.apache.org/jira/browse/HBASE-14663
>
> -r
>
>
>
> On 10/21/15, 10:35 AM, "Vladimir Rodionov" <vl...@gmail.com> wrote:
>
> >You are right, Randy
> >
> >This is the bug. Will you open JIRA?
> >
> >-Vlad
> >
> >On Wed, Oct 21, 2015 at 9:35 AM, Randy Fox <rf...@connexity.com> wrote:
> >
> >> Maybe I am looking in the wrong place but Hstore::close() has the
> >> evictOnClose parameter hard coded to true:
> >>
> >> // close each store file in parallel
> >> CompletionService<Void> completionService =
> >>   new ExecutorCompletionService<Void>(storeFileCloserThreadPool);
> >> for (final StoreFile f : result) {
> >>   completionService.submit(new Callable<Void>() {
> >>     @Override
> >>     public Void call() throws IOException {
> >>       f.closeReader(true);
> >>       return null;
> >>     }
> >>   });
> >> }
> >>
> >>
> >> Where does that setting come into play?
> >>
> >> -r
> >>
> >>
> >>
> >>
> >>
> >> On 10/21/15, 8:14 AM, "Vladimir Rodionov" <vl...@gmail.com>
> wrote:
> >>
> >> >I wonder why disabling cache eviction on close does not work in a case
> of
> >> a
> >> >bucket cache? I checked the code and did not find
> >> >anything suspicious.  It has to work.
> >> >
> >> >On Wed, Oct 21, 2015 at 3:52 AM, ramkrishna vasudevan <
> >> >ramkrishna.s.vasudevan@gmail.com> wrote:
> >> >
> >> >> Seems that the BucketAllocator#freeBlock() is synchronized and hence
> all
> >> >> the bulk close that it tries to do will be blocked in the
> synchronized
> >> >> block.  May be something like the IdLock has to be tried here?
> >> >>
> >> >> Regards
> >> >> Ram
> >> >>
> >> >> On Wed, Oct 21, 2015 at 4:20 PM, ramkrishna vasudevan <
> >> >> ramkrishna.s.vasudevan@gmail.com> wrote:
> >> >>
> >> >> > I think the forceful clearing of the blocks from the bucket cache
> is
> >> >> > hurting in this case.  I think it is worth opening a JIRA for this
> and
> >> >> work
> >> >> > on a fix.
> >> >> >
> >> >> > Regards
> >> >> > Ram
> >> >> >
> >> >> > On Wed, Oct 21, 2015 at 12:00 AM, Randy Fox <rf...@connexity.com>
> >> wrote:
> >> >> >
> >> >> >> Hi Vlad,
> >> >> >>
> >> >> >> I tried it on a table and on a RegionServer basis and it appears
> to
> >> have
> >> >> >> no affect.
> >> >> >> Are we sure it is supported for bucket cache?  From my charts the
> >> bucket
> >> >> >> cache is getting cleared at the same time as the region moves
> >> occurred.
> >> >> >> The regions slow to move are the ones with bucket cache.
> >> >> >>
> >> >> >> I took a table with 102 regions and blockcache true and turned off
> >> block
> >> >> >> cache via alter while the table is enabled - it took 19 minutes.
> To
> >> >> turn
> >> >> >> block cache back on took 4.3 seconds.
> >> >> >>
> >> >> >> Let me know if there is anything else to try.  This issue is
> really
> >> >> >> hurting our day to day ops.
> >> >> >>
> >> >> >> Thanks,
> >> >> >>
> >> >> >> Randy
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On 10/15/15, 3:55 PM, "Vladimir Rodionov" <vladrodionov@gmail.com
> >
> >> >> wrote:
> >> >> >>
> >> >> >> >Hey, Randy
> >> >> >> >
> >> >> >> >You can verify your hypothesis by setting
> >> hbase.rs.evictblocksonclose
> >> >> to
> >> >> >> >false for your tables.
> >> >> >> >
> >> >> >> >-Vlad
> >> >> >> >
> >> >> >> >On Thu, Oct 15, 2015 at 1:06 PM, Randy Fox <rf...@connexity.com>
> >> wrote:
> >> >> >> >
> >> >> >> >> Caveat - we are trying to tune the BucketCache (probably a new
> >> thread
> >> >> >> - as
> >> >> >> >> we are not sure we are getting the most out of it)
> >> >> >> >> 72G off heap
> >> >> >> >>
> >> >> >> >> <property>
> >> >> >> >>    <name>hfile.block.cache.size</name>
> >> >> >> >>    <value>0.58</value>
> >> >> >> >> </property>
> >> >> >> >>
> >> >> >> >> <property>
> >> >> >> >>    <name>hbase.bucketcache.ioengine</name>
> >> >> >> >>    <value>offheap</value>
> >> >> >> >> </property>
> >> >> >> >>
> >> >> >> >> <property>
> >> >> >> >>    <name>hbase.bucketcache.size</name>
> >> >> >> >>    <value>72800</value>
> >> >> >> >> </property>
> >> >> >> >>
> >> >> >> >> <property>
> >> >> >> >>    <name>hbase.bucketcache.bucket.sizes</name>
> >> >> >> >>    <value>9216,17408,33792,66560</value>
> >> >> >> >> </property>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On 10/15/15, 12:00 PM, "Ted Yu" <yu...@gmail.com> wrote:
> >> >> >> >>
> >> >> >> >> >I am a bit curious.
> >> >> >> >> >0.94 doesn't have BucketCache.
> >> >> >> >> >
> >> >> >> >> >Can you share BucketCache related config parameters in your
> >> cluster
> >> >> ?
> >> >> >> >> >
> >> >> >> >> >Cheers
> >> >> >> >> >
> >> >> >> >> >On Thu, Oct 15, 2015 at 11:11 AM, Randy Fox <
> rfox@connexity.com>
> >> >> >> wrote:
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> "StoreFileCloserThread-L-1" prio=10 tid=0x00000000027ec800
> >> >> >> nid=0xad84
> >> >> >> >> >> runnable [0x00007fbcc0c65000]
> >> >> >> >> >>    java.lang.Thread.State: RUNNABLE
> >> >> >> >> >>         at java.util.LinkedList.indexOf(LinkedList.java:602)
> >> >> >> >> >>         at
> java.util.LinkedList.contains(LinkedList.java:315)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator$BucketSizeInfo.freeBlock(BucketAllocator.java:247)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.freeBlock(BucketAllocator.java:449)
> >> >> >> >> >>         - locked <0x000000041b0887a8> (a
> >> >> >> >> >> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlock(BucketCache.java:459)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlocksByHfileName(BucketCache.java:1036)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.evictBlocksByHfileName(CombinedBlockCache.java:90)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.close(HFileReaderV2.java:516)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.close(StoreFile.java:1143)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.regionserver.StoreFile.closeReader(StoreFile.java:503)
> >> >> >> >> >>         - locked <0x00000004944ff2d8> (a
> >> >> >> >> >> org.apache.hadoop.hbase.regionserver.StoreFile)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:873)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:870)
> >> >> >> >> >>         at
> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >>
> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >> >> >> >> >>         at
> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> "StoreCloserThread-Wildfire_graph3,\x00\x04lK\x1B\xFC\x10\xD2,1402949830657.afb6a1720d936a83d73022aeb9ddbb6c.-1"
> >> >> >> >> >> prio=10 tid=0x0000000003508800 nid=0xad83 waiting on
> condition
> >> >> >> >> >> [0x00007fbcc5dcc000]
> >> >> >> >> >>    java.lang.Thread.State: WAITING (parking)
> >> >> >> >> >>         at sun.misc.Unsafe.park(Native Method)
> >> >> >> >> >>         - parking to wait for  <0x0000000534e90a80> (a
> >> >> >> >> >>
> >> >> >>
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:883)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:126)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >>
> >> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1378)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >>
> >> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1375)
> >> >> >> >> >>         at
> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >>
> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >> >> >> >> >>         at
> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> "RS_CLOSE_REGION-hb20:60020-0" prio=10
> tid=0x00007fcec0142000
> >> >> >> nid=0x3056
> >> >> >> >> >> waiting on condition [0x00007fbcc2d87000]
> >> >> >> >> >>    java.lang.Thread.State: WAITING (parking)
> >> >> >> >> >>         at sun.misc.Unsafe.park(Native Method)
> >> >> >> >> >>         - parking to wait for  <0x0000000534e61360> (a
> >> >> >> >> >>
> >> >> >>
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >>
> >> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1385)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >>
> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1280)
> >> >> >> >> >>         - locked <0x000000042230fa68> (a java.lang.Object)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >>
> >> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >> >> >> >>         at
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> I attached the whole thing as well.
> >> >> >> >> >>
> >> >> >> >> >> -r
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> On 10/15/15, 10:39 AM, "Ted Yu" <yu...@gmail.com>
> wrote:
> >> >> >> >> >>
> >> >> >> >> >> >Can you give a bit more detail on why block eviction was
> cause
> >> >> for
> >> >> >> the
> >> >> >> >> >> slow region movement?
> >> >> >> >> >> >
> >> >> >> >> >> >Did you happen to take stack traces ?
> >> >> >> >> >> >
> >> >> >> >> >> >Thanks
> >> >> >> >> >> >
> >> >> >> >> >> >> On Oct 15, 2015, at 10:32 AM, Randy Fox <
> rfox@connexity.com
> >> >
> >> >> >> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> Hi,
> >> >> >> >> >> >>
> >> >> >> >> >> >> We just upgraded from 0.94 to 1.0.0 and have noticed that
> >> >> region
> >> >> >> >> moves
> >> >> >> >> >> are super slow (order of minutes) whereas previously they
> >> where in
> >> >> >> the
> >> >> >> >> >> seconds range.  After looking at the code, I think the time
> is
> >> >> spent
> >> >> >> >> >> waiting for the blocks to be evicted from block cache.
> >> >> >> >> >> >>
> >> >> >> >> >> >> I wanted to verify that this theory is correct and see if
> >> there
> >> >> >> is
> >> >> >> >> >> anything that can be done to speed up the moves.
> >> >> >> >> >> >>
> >> >> >> >> >> >> This is particular painful as we are trying to get our
> >> configs
> >> >> >> tuned
> >> >> >> >> to
> >> >> >> >> >> the new SW and need to do rolling restarts which is taking
> >> almost
> >> >> 24
> >> >> >> >> hours
> >> >> >> >> >> on our cluster.  We also do our own manual rebalancing of
> >> regions
> >> >> >> across
> >> >> >> >> >> RS’s and that task is also now painful.
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >> Thanks,
> >> >> >> >> >> >>
> >> >> >> >> >> >> Randy
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >> >
> >> >> >
> >> >>
> >>
>

Re: Slow region moves

Posted by Randy Fox <rf...@connexity.com>.

https://issues.apache.org/jira/browse/HBASE-14663

-r



On 10/21/15, 10:35 AM, "Vladimir Rodionov" <vl...@gmail.com> wrote:

>You are right, Randy
>
>This is the bug. Will you open JIRA?
>
>-Vlad
>
>On Wed, Oct 21, 2015 at 9:35 AM, Randy Fox <rf...@connexity.com> wrote:
>
>> Maybe I am looking in the wrong place but Hstore::close() has the
>> evictOnClose parameter hard coded to true:
>>
>> // close each store file in parallel
>> CompletionService<Void> completionService =
>>   new ExecutorCompletionService<Void>(storeFileCloserThreadPool);
>> for (final StoreFile f : result) {
>>   completionService.submit(new Callable<Void>() {
>>     @Override
>>     public Void call() throws IOException {
>>       f.closeReader(true);
>>       return null;
>>     }
>>   });
>> }
>>
>>
>> Where does that setting come into play?
>>
>> -r
>>
>>
>>
>>
>>
>> On 10/21/15, 8:14 AM, "Vladimir Rodionov" <vl...@gmail.com> wrote:
>>
>> >I wonder why disabling cache eviction on close does not work in a case of
>> a
>> >bucket cache? I checked the code and did not find
>> >anything suspicious.  It has to work.
>> >
>> >On Wed, Oct 21, 2015 at 3:52 AM, ramkrishna vasudevan <
>> >ramkrishna.s.vasudevan@gmail.com> wrote:
>> >
>> >> Seems that the BucketAllocator#freeBlock() is synchronized and hence all
>> >> the bulk close that it tries to do will be blocked in the synchronized
>> >> block.  May be something like the IdLock has to be tried here?
>> >>
>> >> Regards
>> >> Ram
>> >>
>> >> On Wed, Oct 21, 2015 at 4:20 PM, ramkrishna vasudevan <
>> >> ramkrishna.s.vasudevan@gmail.com> wrote:
>> >>
>> >> > I think the forceful clearing of the blocks from the bucket cache is
>> >> > hurting in this case.  I think it is worth opening a JIRA for this and
>> >> work
>> >> > on a fix.
>> >> >
>> >> > Regards
>> >> > Ram
>> >> >
>> >> > On Wed, Oct 21, 2015 at 12:00 AM, Randy Fox <rf...@connexity.com>
>> wrote:
>> >> >
>> >> >> Hi Vlad,
>> >> >>
>> >> >> I tried it on a table and on a RegionServer basis and it appears to
>> have
>> >> >> no affect.
>> >> >> Are we sure it is supported for bucket cache?  From my charts the
>> bucket
>> >> >> cache is getting cleared at the same time as the region moves
>> occurred.
>> >> >> The regions slow to move are the ones with bucket cache.
>> >> >>
>> >> >> I took a table with 102 regions and blockcache true and turned off
>> block
>> >> >> cache via alter while the table is enabled - it took 19 minutes.  To
>> >> turn
>> >> >> block cache back on took 4.3 seconds.
>> >> >>
>> >> >> Let me know if there is anything else to try.  This issue is really
>> >> >> hurting our day to day ops.
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> Randy
>> >> >>
>> >> >>
>> >> >>
>> >> >> On 10/15/15, 3:55 PM, "Vladimir Rodionov" <vl...@gmail.com>
>> >> wrote:
>> >> >>
>> >> >> >Hey, Randy
>> >> >> >
>> >> >> >You can verify your hypothesis by setting
>> hbase.rs.evictblocksonclose
>> >> to
>> >> >> >false for your tables.
>> >> >> >
>> >> >> >-Vlad
>> >> >> >
>> >> >> >On Thu, Oct 15, 2015 at 1:06 PM, Randy Fox <rf...@connexity.com>
>> wrote:
>> >> >> >
>> >> >> >> Caveat - we are trying to tune the BucketCache (probably a new
>> thread
>> >> >> - as
>> >> >> >> we are not sure we are getting the most out of it)
>> >> >> >> 72G off heap
>> >> >> >>
>> >> >> >> <property>
>> >> >> >>    <name>hfile.block.cache.size</name>
>> >> >> >>    <value>0.58</value>
>> >> >> >> </property>
>> >> >> >>
>> >> >> >> <property>
>> >> >> >>    <name>hbase.bucketcache.ioengine</name>
>> >> >> >>    <value>offheap</value>
>> >> >> >> </property>
>> >> >> >>
>> >> >> >> <property>
>> >> >> >>    <name>hbase.bucketcache.size</name>
>> >> >> >>    <value>72800</value>
>> >> >> >> </property>
>> >> >> >>
>> >> >> >> <property>
>> >> >> >>    <name>hbase.bucketcache.bucket.sizes</name>
>> >> >> >>    <value>9216,17408,33792,66560</value>
>> >> >> >> </property>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> On 10/15/15, 12:00 PM, "Ted Yu" <yu...@gmail.com> wrote:
>> >> >> >>
>> >> >> >> >I am a bit curious.
>> >> >> >> >0.94 doesn't have BucketCache.
>> >> >> >> >
>> >> >> >> >Can you share BucketCache related config parameters in your
>> cluster
>> >> ?
>> >> >> >> >
>> >> >> >> >Cheers
>> >> >> >> >
>> >> >> >> >On Thu, Oct 15, 2015 at 11:11 AM, Randy Fox <rf...@connexity.com>
>> >> >> wrote:
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> "StoreFileCloserThread-L-1" prio=10 tid=0x00000000027ec800
>> >> >> nid=0xad84
>> >> >> >> >> runnable [0x00007fbcc0c65000]
>> >> >> >> >>    java.lang.Thread.State: RUNNABLE
>> >> >> >> >>         at java.util.LinkedList.indexOf(LinkedList.java:602)
>> >> >> >> >>         at java.util.LinkedList.contains(LinkedList.java:315)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator$BucketSizeInfo.freeBlock(BucketAllocator.java:247)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.freeBlock(BucketAllocator.java:449)
>> >> >> >> >>         - locked <0x000000041b0887a8> (a
>> >> >> >> >> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlock(BucketCache.java:459)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlocksByHfileName(BucketCache.java:1036)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.evictBlocksByHfileName(CombinedBlockCache.java:90)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.close(HFileReaderV2.java:516)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.close(StoreFile.java:1143)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.regionserver.StoreFile.closeReader(StoreFile.java:503)
>> >> >> >> >>         - locked <0x00000004944ff2d8> (a
>> >> >> >> >> org.apache.hadoop.hbase.regionserver.StoreFile)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:873)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:870)
>> >> >> >> >>         at
>> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >>
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> >> >> >> >>         at
>> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> "StoreCloserThread-Wildfire_graph3,\x00\x04lK\x1B\xFC\x10\xD2,1402949830657.afb6a1720d936a83d73022aeb9ddbb6c.-1"
>> >> >> >> >> prio=10 tid=0x0000000003508800 nid=0xad83 waiting on condition
>> >> >> >> >> [0x00007fbcc5dcc000]
>> >> >> >> >>    java.lang.Thread.State: WAITING (parking)
>> >> >> >> >>         at sun.misc.Unsafe.park(Native Method)
>> >> >> >> >>         - parking to wait for  <0x0000000534e90a80> (a
>> >> >> >> >>
>> >> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> >> >> >> >>         at
>> >> >> >> >>
>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>> >> >> >> >>         at
>> >> >> >> >>
>> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:883)
>> >> >> >> >>         at
>> >> >> >> >>
>> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:126)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >>
>> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1378)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >>
>> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1375)
>> >> >> >> >>         at
>> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >>
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> >> >> >> >>         at
>> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> "RS_CLOSE_REGION-hb20:60020-0" prio=10 tid=0x00007fcec0142000
>> >> >> nid=0x3056
>> >> >> >> >> waiting on condition [0x00007fbcc2d87000]
>> >> >> >> >>    java.lang.Thread.State: WAITING (parking)
>> >> >> >> >>         at sun.misc.Unsafe.park(Native Method)
>> >> >> >> >>         - parking to wait for  <0x0000000534e61360> (a
>> >> >> >> >>
>> >> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> >> >> >> >>         at
>> >> >> >> >>
>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >>
>> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1385)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1280)
>> >> >> >> >>         - locked <0x000000042230fa68> (a java.lang.Object)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >>
>> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> I attached the whole thing as well.
>> >> >> >> >>
>> >> >> >> >> -r
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On 10/15/15, 10:39 AM, "Ted Yu" <yu...@gmail.com> wrote:
>> >> >> >> >>
>> >> >> >> >> >Can you give a bit more detail on why block eviction was cause
>> >> for
>> >> >> the
>> >> >> >> >> slow region movement?
>> >> >> >> >> >
>> >> >> >> >> >Did you happen to take stack traces ?
>> >> >> >> >> >
>> >> >> >> >> >Thanks
>> >> >> >> >> >
>> >> >> >> >> >> On Oct 15, 2015, at 10:32 AM, Randy Fox <rfox@connexity.com
>> >
>> >> >> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> Hi,
>> >> >> >> >> >>
>> >> >> >> >> >> We just upgraded from 0.94 to 1.0.0 and have noticed that
>> >> region
>> >> >> >> moves
>> >> >> >> >> are super slow (order of minutes) whereas previously they
>> where in
>> >> >> the
>> >> >> >> >> seconds range.  After looking at the code, I think the time is
>> >> spent
>> >> >> >> >> waiting for the blocks to be evicted from block cache.
>> >> >> >> >> >>
>> >> >> >> >> >> I wanted to verify that this theory is correct and see if
>> there
>> >> >> is
>> >> >> >> >> anything that can be done to speed up the moves.
>> >> >> >> >> >>
>> >> >> >> >> >> This is particular painful as we are trying to get our
>> configs
>> >> >> tuned
>> >> >> >> to
>> >> >> >> >> the new SW and need to do rolling restarts which is taking
>> almost
>> >> 24
>> >> >> >> hours
>> >> >> >> >> on our cluster.  We also do our own manual rebalancing of
>> regions
>> >> >> across
>> >> >> >> >> RS’s and that task is also now painful.
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> Thanks,
>> >> >> >> >> >>
>> >> >> >> >> >> Randy
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >> >
>> >> >
>> >>
>>

Re: Slow region moves

Posted by Vladimir Rodionov <vl...@gmail.com>.

You are right, Randy

This is the bug. Will you open JIRA?

-Vlad

On Wed, Oct 21, 2015 at 9:35 AM, Randy Fox <rf...@connexity.com> wrote:

> Maybe I am looking in the wrong place but Hstore::close() has the
> evictOnClose parameter hard coded to true:
>
> // close each store file in parallel
> CompletionService<Void> completionService =
>   new ExecutorCompletionService<Void>(storeFileCloserThreadPool);
> for (final StoreFile f : result) {
>   completionService.submit(new Callable<Void>() {
>     @Override
>     public Void call() throws IOException {
>       f.closeReader(true);
>       return null;
>     }
>   });
> }
>
>
> Where does that setting come into play?
>
> -r
>
>
>
>
>
> On 10/21/15, 8:14 AM, "Vladimir Rodionov" <vl...@gmail.com> wrote:
>
> >I wonder why disabling cache eviction on close does not work in a case of
> a
> >bucket cache? I checked the code and did not find
> >anything suspicious.  It has to work.
> >
> >On Wed, Oct 21, 2015 at 3:52 AM, ramkrishna vasudevan <
> >ramkrishna.s.vasudevan@gmail.com> wrote:
> >
> >> Seems that the BucketAllocator#freeBlock() is synchronized and hence all
> >> the bulk close that it tries to do will be blocked in the synchronized
> >> block.  May be something like the IdLock has to be tried here?
> >>
> >> Regards
> >> Ram
> >>
> >> On Wed, Oct 21, 2015 at 4:20 PM, ramkrishna vasudevan <
> >> ramkrishna.s.vasudevan@gmail.com> wrote:
> >>
> >> > I think the forceful clearing of the blocks from the bucket cache is
> >> > hurting in this case.  I think it is worth opening a JIRA for this and
> >> work
> >> > on a fix.
> >> >
> >> > Regards
> >> > Ram
> >> >
> >> > On Wed, Oct 21, 2015 at 12:00 AM, Randy Fox <rf...@connexity.com>
> wrote:
> >> >
> >> >> Hi Vlad,
> >> >>
> >> >> I tried it on a table and on a RegionServer basis and it appears to
> have
> >> >> no affect.
> >> >> Are we sure it is supported for bucket cache?  From my charts the
> bucket
> >> >> cache is getting cleared at the same time as the region moves
> occurred.
> >> >> The regions slow to move are the ones with bucket cache.
> >> >>
> >> >> I took a table with 102 regions and blockcache true and turned off
> block
> >> >> cache via alter while the table is enabled - it took 19 minutes.  To
> >> turn
> >> >> block cache back on took 4.3 seconds.
> >> >>
> >> >> Let me know if there is anything else to try.  This issue is really
> >> >> hurting our day to day ops.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Randy
> >> >>
> >> >>
> >> >>
> >> >> On 10/15/15, 3:55 PM, "Vladimir Rodionov" <vl...@gmail.com>
> >> wrote:
> >> >>
> >> >> >Hey, Randy
> >> >> >
> >> >> >You can verify your hypothesis by setting
> hbase.rs.evictblocksonclose
> >> to
> >> >> >false for your tables.
> >> >> >
> >> >> >-Vlad
> >> >> >
> >> >> >On Thu, Oct 15, 2015 at 1:06 PM, Randy Fox <rf...@connexity.com>
> wrote:
> >> >> >
> >> >> >> Caveat - we are trying to tune the BucketCache (probably a new
> thread
> >> >> - as
> >> >> >> we are not sure we are getting the most out of it)
> >> >> >> 72G off heap
> >> >> >>
> >> >> >> <property>
> >> >> >>    <name>hfile.block.cache.size</name>
> >> >> >>    <value>0.58</value>
> >> >> >> </property>
> >> >> >>
> >> >> >> <property>
> >> >> >>    <name>hbase.bucketcache.ioengine</name>
> >> >> >>    <value>offheap</value>
> >> >> >> </property>
> >> >> >>
> >> >> >> <property>
> >> >> >>    <name>hbase.bucketcache.size</name>
> >> >> >>    <value>72800</value>
> >> >> >> </property>
> >> >> >>
> >> >> >> <property>
> >> >> >>    <name>hbase.bucketcache.bucket.sizes</name>
> >> >> >>    <value>9216,17408,33792,66560</value>
> >> >> >> </property>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On 10/15/15, 12:00 PM, "Ted Yu" <yu...@gmail.com> wrote:
> >> >> >>
> >> >> >> >I am a bit curious.
> >> >> >> >0.94 doesn't have BucketCache.
> >> >> >> >
> >> >> >> >Can you share BucketCache related config parameters in your
> cluster
> >> ?
> >> >> >> >
> >> >> >> >Cheers
> >> >> >> >
> >> >> >> >On Thu, Oct 15, 2015 at 11:11 AM, Randy Fox <rf...@connexity.com>
> >> >> wrote:
> >> >> >> >
> >> >> >> >>
> >> >> >> >> "StoreFileCloserThread-L-1" prio=10 tid=0x00000000027ec800
> >> >> nid=0xad84
> >> >> >> >> runnable [0x00007fbcc0c65000]
> >> >> >> >>    java.lang.Thread.State: RUNNABLE
> >> >> >> >>         at java.util.LinkedList.indexOf(LinkedList.java:602)
> >> >> >> >>         at java.util.LinkedList.contains(LinkedList.java:315)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator$BucketSizeInfo.freeBlock(BucketAllocator.java:247)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.freeBlock(BucketAllocator.java:449)
> >> >> >> >>         - locked <0x000000041b0887a8> (a
> >> >> >> >> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlock(BucketCache.java:459)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlocksByHfileName(BucketCache.java:1036)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.evictBlocksByHfileName(CombinedBlockCache.java:90)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.close(HFileReaderV2.java:516)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.close(StoreFile.java:1143)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.regionserver.StoreFile.closeReader(StoreFile.java:503)
> >> >> >> >>         - locked <0x00000004944ff2d8> (a
> >> >> >> >> org.apache.hadoop.hbase.regionserver.StoreFile)
> >> >> >> >>         at
> >> >> >> >>
> >> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:873)
> >> >> >> >>         at
> >> >> >> >>
> >> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:870)
> >> >> >> >>         at
> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >> >> >>         at
> >> >> >> >>
> >> >>
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >> >> >> >>         at
> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> "StoreCloserThread-Wildfire_graph3,\x00\x04lK\x1B\xFC\x10\xD2,1402949830657.afb6a1720d936a83d73022aeb9ddbb6c.-1"
> >> >> >> >> prio=10 tid=0x0000000003508800 nid=0xad83 waiting on condition
> >> >> >> >> [0x00007fbcc5dcc000]
> >> >> >> >>    java.lang.Thread.State: WAITING (parking)
> >> >> >> >>         at sun.misc.Unsafe.park(Native Method)
> >> >> >> >>         - parking to wait for  <0x0000000534e90a80> (a
> >> >> >> >>
> >> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >> >> >> >>         at
> >> >> >> >>
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
> >> >> >> >>         at
> >> >> >> >>
> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:883)
> >> >> >> >>         at
> >> >> >> >>
> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:126)
> >> >> >> >>         at
> >> >> >> >>
> >> >>
> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1378)
> >> >> >> >>         at
> >> >> >> >>
> >> >>
> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1375)
> >> >> >> >>         at
> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >> >> >>         at
> >> >> >> >>
> >> >>
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >> >> >> >>         at
> >> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> "RS_CLOSE_REGION-hb20:60020-0" prio=10 tid=0x00007fcec0142000
> >> >> nid=0x3056
> >> >> >> >> waiting on condition [0x00007fbcc2d87000]
> >> >> >> >>    java.lang.Thread.State: WAITING (parking)
> >> >> >> >>         at sun.misc.Unsafe.park(Native Method)
> >> >> >> >>         - parking to wait for  <0x0000000534e61360> (a
> >> >> >> >>
> >> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >> >> >> >>         at
> >> >> >> >>
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
> >> >> >> >>         at
> >> >> >> >>
> >> >>
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1385)
> >> >> >> >>         at
> >> >> >> >>
> >> >> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1280)
> >> >> >> >>         - locked <0x000000042230fa68> (a java.lang.Object)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
> >> >> >> >>         at
> >> >> >> >>
> >> >>
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >> >> >>         at java.lang.Thread.run(Thread.java:745)
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> I attached the whole thing as well.
> >> >> >> >>
> >> >> >> >> -r
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On 10/15/15, 10:39 AM, "Ted Yu" <yu...@gmail.com> wrote:
> >> >> >> >>
> >> >> >> >> >Can you give a bit more detail on why block eviction was cause
> >> for
> >> >> the
> >> >> >> >> slow region movement?
> >> >> >> >> >
> >> >> >> >> >Did you happen to take stack traces ?
> >> >> >> >> >
> >> >> >> >> >Thanks
> >> >> >> >> >
> >> >> >> >> >> On Oct 15, 2015, at 10:32 AM, Randy Fox <rfox@connexity.com
> >
> >> >> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Hi,
> >> >> >> >> >>
> >> >> >> >> >> We just upgraded from 0.94 to 1.0.0 and have noticed that
> >> region
> >> >> >> moves
> >> >> >> >> are super slow (order of minutes) whereas previously they
> where in
> >> >> the
> >> >> >> >> seconds range.  After looking at the code, I think the time is
> >> spent
> >> >> >> >> waiting for the blocks to be evicted from block cache.
> >> >> >> >> >>
> >> >> >> >> >> I wanted to verify that this theory is correct and see if
> there
> >> >> is
> >> >> >> >> anything that can be done to speed up the moves.
> >> >> >> >> >>
> >> >> >> >> >> This is particular painful as we are trying to get our
> configs
> >> >> tuned
> >> >> >> to
> >> >> >> >> the new SW and need to do rolling restarts which is taking
> almost
> >> 24
> >> >> >> hours
> >> >> >> >> on our cluster.  We also do our own manual rebalancing of
> regions
> >> >> across
> >> >> >> >> RS’s and that task is also now painful.
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> Thanks,
> >> >> >> >> >>
> >> >> >> >> >> Randy
> >> >> >> >>
> >> >> >>
> >> >>
> >> >
> >> >
> >>
>

Re: Slow region moves

Posted by Randy Fox <rf...@connexity.com>.

Maybe I am looking in the wrong place but Hstore::close() has the evictOnClose parameter hard coded to true:

// close each store file in parallel
CompletionService<Void> completionService =
  new ExecutorCompletionService<Void>(storeFileCloserThreadPool);
for (final StoreFile f : result) {
  completionService.submit(new Callable<Void>() {
    @Override
    public Void call() throws IOException {
      f.closeReader(true);
      return null;
    }
  });
}


Where does that setting come into play?

-r





On 10/21/15, 8:14 AM, "Vladimir Rodionov" <vl...@gmail.com> wrote:

>I wonder why disabling cache eviction on close does not work in a case of a
>bucket cache? I checked the code and did not find
>anything suspicious.  It has to work.
>
>On Wed, Oct 21, 2015 at 3:52 AM, ramkrishna vasudevan <
>ramkrishna.s.vasudevan@gmail.com> wrote:
>
>> Seems that the BucketAllocator#freeBlock() is synchronized and hence all
>> the bulk close that it tries to do will be blocked in the synchronized
>> block.  May be something like the IdLock has to be tried here?
>>
>> Regards
>> Ram
>>
>> On Wed, Oct 21, 2015 at 4:20 PM, ramkrishna vasudevan <
>> ramkrishna.s.vasudevan@gmail.com> wrote:
>>
>> > I think the forceful clearing of the blocks from the bucket cache is
>> > hurting in this case.  I think it is worth opening a JIRA for this and
>> work
>> > on a fix.
>> >
>> > Regards
>> > Ram
>> >
>> > On Wed, Oct 21, 2015 at 12:00 AM, Randy Fox <rf...@connexity.com> wrote:
>> >
>> >> Hi Vlad,
>> >>
>> >> I tried it on a table and on a RegionServer basis and it appears to have
>> >> no affect.
>> >> Are we sure it is supported for bucket cache?  From my charts the bucket
>> >> cache is getting cleared at the same time as the region moves occurred.
>> >> The regions slow to move are the ones with bucket cache.
>> >>
>> >> I took a table with 102 regions and blockcache true and turned off block
>> >> cache via alter while the table is enabled - it took 19 minutes.  To
>> turn
>> >> block cache back on took 4.3 seconds.
>> >>
>> >> Let me know if there is anything else to try.  This issue is really
>> >> hurting our day to day ops.
>> >>
>> >> Thanks,
>> >>
>> >> Randy
>> >>
>> >>
>> >>
>> >> On 10/15/15, 3:55 PM, "Vladimir Rodionov" <vl...@gmail.com>
>> wrote:
>> >>
>> >> >Hey, Randy
>> >> >
>> >> >You can verify your hypothesis by setting hbase.rs.evictblocksonclose
>> to
>> >> >false for your tables.
>> >> >
>> >> >-Vlad
>> >> >
>> >> >On Thu, Oct 15, 2015 at 1:06 PM, Randy Fox <rf...@connexity.com> wrote:
>> >> >
>> >> >> Caveat - we are trying to tune the BucketCache (probably a new thread
>> >> - as
>> >> >> we are not sure we are getting the most out of it)
>> >> >> 72G off heap
>> >> >>
>> >> >> <property>
>> >> >>    <name>hfile.block.cache.size</name>
>> >> >>    <value>0.58</value>
>> >> >> </property>
>> >> >>
>> >> >> <property>
>> >> >>    <name>hbase.bucketcache.ioengine</name>
>> >> >>    <value>offheap</value>
>> >> >> </property>
>> >> >>
>> >> >> <property>
>> >> >>    <name>hbase.bucketcache.size</name>
>> >> >>    <value>72800</value>
>> >> >> </property>
>> >> >>
>> >> >> <property>
>> >> >>    <name>hbase.bucketcache.bucket.sizes</name>
>> >> >>    <value>9216,17408,33792,66560</value>
>> >> >> </property>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On 10/15/15, 12:00 PM, "Ted Yu" <yu...@gmail.com> wrote:
>> >> >>
>> >> >> >I am a bit curious.
>> >> >> >0.94 doesn't have BucketCache.
>> >> >> >
>> >> >> >Can you share BucketCache related config parameters in your cluster
>> ?
>> >> >> >
>> >> >> >Cheers
>> >> >> >
>> >> >> >On Thu, Oct 15, 2015 at 11:11 AM, Randy Fox <rf...@connexity.com>
>> >> wrote:
>> >> >> >
>> >> >> >>
>> >> >> >> "StoreFileCloserThread-L-1" prio=10 tid=0x00000000027ec800
>> >> nid=0xad84
>> >> >> >> runnable [0x00007fbcc0c65000]
>> >> >> >>    java.lang.Thread.State: RUNNABLE
>> >> >> >>         at java.util.LinkedList.indexOf(LinkedList.java:602)
>> >> >> >>         at java.util.LinkedList.contains(LinkedList.java:315)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator$BucketSizeInfo.freeBlock(BucketAllocator.java:247)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.freeBlock(BucketAllocator.java:449)
>> >> >> >>         - locked <0x000000041b0887a8> (a
>> >> >> >> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlock(BucketCache.java:459)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlocksByHfileName(BucketCache.java:1036)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.evictBlocksByHfileName(CombinedBlockCache.java:90)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.close(HFileReaderV2.java:516)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.close(StoreFile.java:1143)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.regionserver.StoreFile.closeReader(StoreFile.java:503)
>> >> >> >>         - locked <0x00000004944ff2d8> (a
>> >> >> >> org.apache.hadoop.hbase.regionserver.StoreFile)
>> >> >> >>         at
>> >> >> >>
>> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:873)
>> >> >> >>         at
>> >> >> >>
>> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:870)
>> >> >> >>         at
>> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >> >>         at
>> >> >> >>
>> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> >> >> >>         at
>> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >> >>         at java.lang.Thread.run(Thread.java:745)
>> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> "StoreCloserThread-Wildfire_graph3,\x00\x04lK\x1B\xFC\x10\xD2,1402949830657.afb6a1720d936a83d73022aeb9ddbb6c.-1"
>> >> >> >> prio=10 tid=0x0000000003508800 nid=0xad83 waiting on condition
>> >> >> >> [0x00007fbcc5dcc000]
>> >> >> >>    java.lang.Thread.State: WAITING (parking)
>> >> >> >>         at sun.misc.Unsafe.park(Native Method)
>> >> >> >>         - parking to wait for  <0x0000000534e90a80> (a
>> >> >> >>
>> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> >> >> >>         at
>> >> >> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>> >> >> >>         at
>> >> >> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:883)
>> >> >> >>         at
>> >> >> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:126)
>> >> >> >>         at
>> >> >> >>
>> >> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1378)
>> >> >> >>         at
>> >> >> >>
>> >> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1375)
>> >> >> >>         at
>> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >> >>         at
>> >> >> >>
>> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> >> >> >>         at
>> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >> >>         at java.lang.Thread.run(Thread.java:745)
>> >> >> >>
>> >> >> >>
>> >> >> >> "RS_CLOSE_REGION-hb20:60020-0" prio=10 tid=0x00007fcec0142000
>> >> nid=0x3056
>> >> >> >> waiting on condition [0x00007fbcc2d87000]
>> >> >> >>    java.lang.Thread.State: WAITING (parking)
>> >> >> >>         at sun.misc.Unsafe.park(Native Method)
>> >> >> >>         - parking to wait for  <0x0000000534e61360> (a
>> >> >> >>
>> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> >> >> >>         at
>> >> >> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>> >> >> >>         at
>> >> >> >>
>> >> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1385)
>> >> >> >>         at
>> >> >> >>
>> >> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1280)
>> >> >> >>         - locked <0x000000042230fa68> (a java.lang.Object)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
>> >> >> >>         at
>> >> >> >>
>> >> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >> >>         at java.lang.Thread.run(Thread.java:745)
>> >> >> >>
>> >> >> >>
>> >> >> >> I attached the whole thing as well.
>> >> >> >>
>> >> >> >> -r
>> >> >> >>
>> >> >> >>
>> >> >> >> On 10/15/15, 10:39 AM, "Ted Yu" <yu...@gmail.com> wrote:
>> >> >> >>
>> >> >> >> >Can you give a bit more detail on why block eviction was cause
>> for
>> >> the
>> >> >> >> slow region movement?
>> >> >> >> >
>> >> >> >> >Did you happen to take stack traces ?
>> >> >> >> >
>> >> >> >> >Thanks
>> >> >> >> >
>> >> >> >> >> On Oct 15, 2015, at 10:32 AM, Randy Fox <rf...@connexity.com>
>> >> wrote:
>> >> >> >> >>
>> >> >> >> >> Hi,
>> >> >> >> >>
>> >> >> >> >> We just upgraded from 0.94 to 1.0.0 and have noticed that
>> region
>> >> >> moves
>> >> >> >> are super slow (order of minutes) whereas previously they where in
>> >> the
>> >> >> >> seconds range.  After looking at the code, I think the time is
>> spent
>> >> >> >> waiting for the blocks to be evicted from block cache.
>> >> >> >> >>
>> >> >> >> >> I wanted to verify that this theory is correct and see if there
>> >> is
>> >> >> >> anything that can be done to speed up the moves.
>> >> >> >> >>
>> >> >> >> >> This is particular painful as we are trying to get our configs
>> >> tuned
>> >> >> to
>> >> >> >> the new SW and need to do rolling restarts which is taking almost
>> 24
>> >> >> hours
>> >> >> >> on our cluster.  We also do our own manual rebalancing of regions
>> >> across
>> >> >> >> RS’s and that task is also now painful.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> Thanks,
>> >> >> >> >>
>> >> >> >> >> Randy
>> >> >> >>
>> >> >>
>> >>
>> >
>> >
>>

Re: Slow region moves

Posted by Vladimir Rodionov <vl...@gmail.com>.

I wonder why disabling cache eviction on close does not work in a case of a
bucket cache? I checked the code and did not find
anything suspicious.  It has to work.

On Wed, Oct 21, 2015 at 3:52 AM, ramkrishna vasudevan <
ramkrishna.s.vasudevan@gmail.com> wrote:

> Seems that the BucketAllocator#freeBlock() is synchronized and hence all
> the bulk close that it tries to do will be blocked in the synchronized
> block.  May be something like the IdLock has to be tried here?
>
> Regards
> Ram
>
> On Wed, Oct 21, 2015 at 4:20 PM, ramkrishna vasudevan <
> ramkrishna.s.vasudevan@gmail.com> wrote:
>
> > I think the forceful clearing of the blocks from the bucket cache is
> > hurting in this case.  I think it is worth opening a JIRA for this and
> work
> > on a fix.
> >
> > Regards
> > Ram
> >
> > On Wed, Oct 21, 2015 at 12:00 AM, Randy Fox <rf...@connexity.com> wrote:
> >
> >> Hi Vlad,
> >>
> >> I tried it on a table and on a RegionServer basis and it appears to have
> >> no affect.
> >> Are we sure it is supported for bucket cache?  From my charts the bucket
> >> cache is getting cleared at the same time as the region moves occurred.
> >> The regions slow to move are the ones with bucket cache.
> >>
> >> I took a table with 102 regions and blockcache true and turned off block
> >> cache via alter while the table is enabled - it took 19 minutes.  To
> turn
> >> block cache back on took 4.3 seconds.
> >>
> >> Let me know if there is anything else to try.  This issue is really
> >> hurting our day to day ops.
> >>
> >> Thanks,
> >>
> >> Randy
> >>
> >>
> >>
> >> On 10/15/15, 3:55 PM, "Vladimir Rodionov" <vl...@gmail.com>
> wrote:
> >>
> >> >Hey, Randy
> >> >
> >> >You can verify your hypothesis by setting hbase.rs.evictblocksonclose
> to
> >> >false for your tables.
> >> >
> >> >-Vlad
> >> >
> >> >On Thu, Oct 15, 2015 at 1:06 PM, Randy Fox <rf...@connexity.com> wrote:
> >> >
> >> >> Caveat - we are trying to tune the BucketCache (probably a new thread
> >> - as
> >> >> we are not sure we are getting the most out of it)
> >> >> 72G off heap
> >> >>
> >> >> <property>
> >> >>    <name>hfile.block.cache.size</name>
> >> >>    <value>0.58</value>
> >> >> </property>
> >> >>
> >> >> <property>
> >> >>    <name>hbase.bucketcache.ioengine</name>
> >> >>    <value>offheap</value>
> >> >> </property>
> >> >>
> >> >> <property>
> >> >>    <name>hbase.bucketcache.size</name>
> >> >>    <value>72800</value>
> >> >> </property>
> >> >>
> >> >> <property>
> >> >>    <name>hbase.bucketcache.bucket.sizes</name>
> >> >>    <value>9216,17408,33792,66560</value>
> >> >> </property>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On 10/15/15, 12:00 PM, "Ted Yu" <yu...@gmail.com> wrote:
> >> >>
> >> >> >I am a bit curious.
> >> >> >0.94 doesn't have BucketCache.
> >> >> >
> >> >> >Can you share BucketCache related config parameters in your cluster
> ?
> >> >> >
> >> >> >Cheers
> >> >> >
> >> >> >On Thu, Oct 15, 2015 at 11:11 AM, Randy Fox <rf...@connexity.com>
> >> wrote:
> >> >> >
> >> >> >>
> >> >> >> "StoreFileCloserThread-L-1" prio=10 tid=0x00000000027ec800
> >> nid=0xad84
> >> >> >> runnable [0x00007fbcc0c65000]
> >> >> >>    java.lang.Thread.State: RUNNABLE
> >> >> >>         at java.util.LinkedList.indexOf(LinkedList.java:602)
> >> >> >>         at java.util.LinkedList.contains(LinkedList.java:315)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator$BucketSizeInfo.freeBlock(BucketAllocator.java:247)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.freeBlock(BucketAllocator.java:449)
> >> >> >>         - locked <0x000000041b0887a8> (a
> >> >> >> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlock(BucketCache.java:459)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlocksByHfileName(BucketCache.java:1036)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.evictBlocksByHfileName(CombinedBlockCache.java:90)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.close(HFileReaderV2.java:516)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.close(StoreFile.java:1143)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.regionserver.StoreFile.closeReader(StoreFile.java:503)
> >> >> >>         - locked <0x00000004944ff2d8> (a
> >> >> >> org.apache.hadoop.hbase.regionserver.StoreFile)
> >> >> >>         at
> >> >> >>
> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:873)
> >> >> >>         at
> >> >> >>
> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:870)
> >> >> >>         at
> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >> >>         at
> >> >> >>
> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >> >> >>         at
> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >> >>         at java.lang.Thread.run(Thread.java:745)
> >> >> >>
> >> >> >>
> >> >>
> >>
> "StoreCloserThread-Wildfire_graph3,\x00\x04lK\x1B\xFC\x10\xD2,1402949830657.afb6a1720d936a83d73022aeb9ddbb6c.-1"
> >> >> >> prio=10 tid=0x0000000003508800 nid=0xad83 waiting on condition
> >> >> >> [0x00007fbcc5dcc000]
> >> >> >>    java.lang.Thread.State: WAITING (parking)
> >> >> >>         at sun.misc.Unsafe.park(Native Method)
> >> >> >>         - parking to wait for  <0x0000000534e90a80> (a
> >> >> >>
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >> >> >>         at
> >> >> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
> >> >> >>         at
> >> >> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:883)
> >> >> >>         at
> >> >> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:126)
> >> >> >>         at
> >> >> >>
> >> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1378)
> >> >> >>         at
> >> >> >>
> >> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1375)
> >> >> >>         at
> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >> >>         at
> >> >> >>
> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >> >> >>         at
> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >> >>         at java.lang.Thread.run(Thread.java:745)
> >> >> >>
> >> >> >>
> >> >> >> "RS_CLOSE_REGION-hb20:60020-0" prio=10 tid=0x00007fcec0142000
> >> nid=0x3056
> >> >> >> waiting on condition [0x00007fbcc2d87000]
> >> >> >>    java.lang.Thread.State: WAITING (parking)
> >> >> >>         at sun.misc.Unsafe.park(Native Method)
> >> >> >>         - parking to wait for  <0x0000000534e61360> (a
> >> >> >>
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >> >> >>         at
> >> >> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
> >> >> >>         at
> >> >> >>
> >> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1385)
> >> >> >>         at
> >> >> >>
> >> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1280)
> >> >> >>         - locked <0x000000042230fa68> (a java.lang.Object)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
> >> >> >>         at
> >> >> >>
> >> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >> >>         at java.lang.Thread.run(Thread.java:745)
> >> >> >>
> >> >> >>
> >> >> >> I attached the whole thing as well.
> >> >> >>
> >> >> >> -r
> >> >> >>
> >> >> >>
> >> >> >> On 10/15/15, 10:39 AM, "Ted Yu" <yu...@gmail.com> wrote:
> >> >> >>
> >> >> >> >Can you give a bit more detail on why block eviction was cause
> for
> >> the
> >> >> >> slow region movement?
> >> >> >> >
> >> >> >> >Did you happen to take stack traces ?
> >> >> >> >
> >> >> >> >Thanks
> >> >> >> >
> >> >> >> >> On Oct 15, 2015, at 10:32 AM, Randy Fox <rf...@connexity.com>
> >> wrote:
> >> >> >> >>
> >> >> >> >> Hi,
> >> >> >> >>
> >> >> >> >> We just upgraded from 0.94 to 1.0.0 and have noticed that
> region
> >> >> moves
> >> >> >> are super slow (order of minutes) whereas previously they where in
> >> the
> >> >> >> seconds range.  After looking at the code, I think the time is
> spent
> >> >> >> waiting for the blocks to be evicted from block cache.
> >> >> >> >>
> >> >> >> >> I wanted to verify that this theory is correct and see if there
> >> is
> >> >> >> anything that can be done to speed up the moves.
> >> >> >> >>
> >> >> >> >> This is particular painful as we are trying to get our configs
> >> tuned
> >> >> to
> >> >> >> the new SW and need to do rolling restarts which is taking almost
> 24
> >> >> hours
> >> >> >> on our cluster.  We also do our own manual rebalancing of regions
> >> across
> >> >> >> RS’s and that task is also now painful.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> Thanks,
> >> >> >> >>
> >> >> >> >> Randy
> >> >> >>
> >> >>
> >>
> >
> >
>

Re: Slow region moves

Posted by ramkrishna vasudevan <ra...@gmail.com>.

Seems that the BucketAllocator#freeBlock() is synchronized and hence all
the bulk close that it tries to do will be blocked in the synchronized
block.  May be something like the IdLock has to be tried here?

Regards
Ram

On Wed, Oct 21, 2015 at 4:20 PM, ramkrishna vasudevan <
ramkrishna.s.vasudevan@gmail.com> wrote:

> I think the forceful clearing of the blocks from the bucket cache is
> hurting in this case.  I think it is worth opening a JIRA for this and work
> on a fix.
>
> Regards
> Ram
>
> On Wed, Oct 21, 2015 at 12:00 AM, Randy Fox <rf...@connexity.com> wrote:
>
>> Hi Vlad,
>>
>> I tried it on a table and on a RegionServer basis and it appears to have
>> no affect.
>> Are we sure it is supported for bucket cache?  From my charts the bucket
>> cache is getting cleared at the same time as the region moves occurred.
>> The regions slow to move are the ones with bucket cache.
>>
>> I took a table with 102 regions and blockcache true and turned off block
>> cache via alter while the table is enabled - it took 19 minutes.  To turn
>> block cache back on took 4.3 seconds.
>>
>> Let me know if there is anything else to try.  This issue is really
>> hurting our day to day ops.
>>
>> Thanks,
>>
>> Randy
>>
>>
>>
>> On 10/15/15, 3:55 PM, "Vladimir Rodionov" <vl...@gmail.com> wrote:
>>
>> >Hey, Randy
>> >
>> >You can verify your hypothesis by setting hbase.rs.evictblocksonclose to
>> >false for your tables.
>> >
>> >-Vlad
>> >
>> >On Thu, Oct 15, 2015 at 1:06 PM, Randy Fox <rf...@connexity.com> wrote:
>> >
>> >> Caveat - we are trying to tune the BucketCache (probably a new thread
>> - as
>> >> we are not sure we are getting the most out of it)
>> >> 72G off heap
>> >>
>> >> <property>
>> >>    <name>hfile.block.cache.size</name>
>> >>    <value>0.58</value>
>> >> </property>
>> >>
>> >> <property>
>> >>    <name>hbase.bucketcache.ioengine</name>
>> >>    <value>offheap</value>
>> >> </property>
>> >>
>> >> <property>
>> >>    <name>hbase.bucketcache.size</name>
>> >>    <value>72800</value>
>> >> </property>
>> >>
>> >> <property>
>> >>    <name>hbase.bucketcache.bucket.sizes</name>
>> >>    <value>9216,17408,33792,66560</value>
>> >> </property>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On 10/15/15, 12:00 PM, "Ted Yu" <yu...@gmail.com> wrote:
>> >>
>> >> >I am a bit curious.
>> >> >0.94 doesn't have BucketCache.
>> >> >
>> >> >Can you share BucketCache related config parameters in your cluster ?
>> >> >
>> >> >Cheers
>> >> >
>> >> >On Thu, Oct 15, 2015 at 11:11 AM, Randy Fox <rf...@connexity.com>
>> wrote:
>> >> >
>> >> >>
>> >> >> "StoreFileCloserThread-L-1" prio=10 tid=0x00000000027ec800
>> nid=0xad84
>> >> >> runnable [0x00007fbcc0c65000]
>> >> >>    java.lang.Thread.State: RUNNABLE
>> >> >>         at java.util.LinkedList.indexOf(LinkedList.java:602)
>> >> >>         at java.util.LinkedList.contains(LinkedList.java:315)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator$BucketSizeInfo.freeBlock(BucketAllocator.java:247)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.freeBlock(BucketAllocator.java:449)
>> >> >>         - locked <0x000000041b0887a8> (a
>> >> >> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlock(BucketCache.java:459)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlocksByHfileName(BucketCache.java:1036)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.evictBlocksByHfileName(CombinedBlockCache.java:90)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.close(HFileReaderV2.java:516)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.close(StoreFile.java:1143)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hbase.regionserver.StoreFile.closeReader(StoreFile.java:503)
>> >> >>         - locked <0x00000004944ff2d8> (a
>> >> >> org.apache.hadoop.hbase.regionserver.StoreFile)
>> >> >>         at
>> >> >> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:873)
>> >> >>         at
>> >> >> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:870)
>> >> >>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >>         at
>> >> >>
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> >> >>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >>         at
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >>         at
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >>         at java.lang.Thread.run(Thread.java:745)
>> >> >>
>> >> >>
>> >>
>> "StoreCloserThread-Wildfire_graph3,\x00\x04lK\x1B\xFC\x10\xD2,1402949830657.afb6a1720d936a83d73022aeb9ddbb6c.-1"
>> >> >> prio=10 tid=0x0000000003508800 nid=0xad83 waiting on condition
>> >> >> [0x00007fbcc5dcc000]
>> >> >>    java.lang.Thread.State: WAITING (parking)
>> >> >>         at sun.misc.Unsafe.park(Native Method)
>> >> >>         - parking to wait for  <0x0000000534e90a80> (a
>> >> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> >> >>         at
>> >> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>> >> >>         at
>> >> >>
>> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>> >> >>         at
>> >> >>
>> >>
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> >> >>         at
>> >> >>
>> >>
>> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>> >> >>         at
>> >> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:883)
>> >> >>         at
>> >> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:126)
>> >> >>         at
>> >> >>
>> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1378)
>> >> >>         at
>> >> >>
>> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1375)
>> >> >>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >>         at
>> >> >>
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> >> >>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >> >>         at
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >>         at
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >>         at java.lang.Thread.run(Thread.java:745)
>> >> >>
>> >> >>
>> >> >> "RS_CLOSE_REGION-hb20:60020-0" prio=10 tid=0x00007fcec0142000
>> nid=0x3056
>> >> >> waiting on condition [0x00007fbcc2d87000]
>> >> >>    java.lang.Thread.State: WAITING (parking)
>> >> >>         at sun.misc.Unsafe.park(Native Method)
>> >> >>         - parking to wait for  <0x0000000534e61360> (a
>> >> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> >> >>         at
>> >> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>> >> >>         at
>> >> >>
>> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>> >> >>         at
>> >> >>
>> >>
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> >> >>         at
>> >> >>
>> >>
>> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>> >> >>         at
>> >> >>
>> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1385)
>> >> >>         at
>> >> >>
>> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1280)
>> >> >>         - locked <0x000000042230fa68> (a java.lang.Object)
>> >> >>         at
>> >> >>
>> >>
>> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
>> >> >>         at
>> >> >>
>> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
>> >> >>         at
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >>         at
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >>         at java.lang.Thread.run(Thread.java:745)
>> >> >>
>> >> >>
>> >> >> I attached the whole thing as well.
>> >> >>
>> >> >> -r
>> >> >>
>> >> >>
>> >> >> On 10/15/15, 10:39 AM, "Ted Yu" <yu...@gmail.com> wrote:
>> >> >>
>> >> >> >Can you give a bit more detail on why block eviction was cause for
>> the
>> >> >> slow region movement?
>> >> >> >
>> >> >> >Did you happen to take stack traces ?
>> >> >> >
>> >> >> >Thanks
>> >> >> >
>> >> >> >> On Oct 15, 2015, at 10:32 AM, Randy Fox <rf...@connexity.com>
>> wrote:
>> >> >> >>
>> >> >> >> Hi,
>> >> >> >>
>> >> >> >> We just upgraded from 0.94 to 1.0.0 and have noticed that region
>> >> moves
>> >> >> are super slow (order of minutes) whereas previously they where in
>> the
>> >> >> seconds range.  After looking at the code, I think the time is spent
>> >> >> waiting for the blocks to be evicted from block cache.
>> >> >> >>
>> >> >> >> I wanted to verify that this theory is correct and see if there
>> is
>> >> >> anything that can be done to speed up the moves.
>> >> >> >>
>> >> >> >> This is particular painful as we are trying to get our configs
>> tuned
>> >> to
>> >> >> the new SW and need to do rolling restarts which is taking almost 24
>> >> hours
>> >> >> on our cluster.  We also do our own manual rebalancing of regions
>> across
>> >> >> RS’s and that task is also now painful.
>> >> >> >>
>> >> >> >>
>> >> >> >> Thanks,
>> >> >> >>
>> >> >> >> Randy
>> >> >>
>> >>
>>
>
>

Re: Slow region moves

Posted by ramkrishna vasudevan <ra...@gmail.com>.

I think the forceful clearing of the blocks from the bucket cache is
hurting in this case.  I think it is worth opening a JIRA for this and work
on a fix.

Regards
Ram

On Wed, Oct 21, 2015 at 12:00 AM, Randy Fox <rf...@connexity.com> wrote:

> Hi Vlad,
>
> I tried it on a table and on a RegionServer basis and it appears to have
> no affect.
> Are we sure it is supported for bucket cache?  From my charts the bucket
> cache is getting cleared at the same time as the region moves occurred.
> The regions slow to move are the ones with bucket cache.
>
> I took a table with 102 regions and blockcache true and turned off block
> cache via alter while the table is enabled - it took 19 minutes.  To turn
> block cache back on took 4.3 seconds.
>
> Let me know if there is anything else to try.  This issue is really
> hurting our day to day ops.
>
> Thanks,
>
> Randy
>
>
>
> On 10/15/15, 3:55 PM, "Vladimir Rodionov" <vl...@gmail.com> wrote:
>
> >Hey, Randy
> >
> >You can verify your hypothesis by setting hbase.rs.evictblocksonclose to
> >false for your tables.
> >
> >-Vlad
> >
> >On Thu, Oct 15, 2015 at 1:06 PM, Randy Fox <rf...@connexity.com> wrote:
> >
> >> Caveat - we are trying to tune the BucketCache (probably a new thread -
> as
> >> we are not sure we are getting the most out of it)
> >> 72G off heap
> >>
> >> <property>
> >>    <name>hfile.block.cache.size</name>
> >>    <value>0.58</value>
> >> </property>
> >>
> >> <property>
> >>    <name>hbase.bucketcache.ioengine</name>
> >>    <value>offheap</value>
> >> </property>
> >>
> >> <property>
> >>    <name>hbase.bucketcache.size</name>
> >>    <value>72800</value>
> >> </property>
> >>
> >> <property>
> >>    <name>hbase.bucketcache.bucket.sizes</name>
> >>    <value>9216,17408,33792,66560</value>
> >> </property>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 10/15/15, 12:00 PM, "Ted Yu" <yu...@gmail.com> wrote:
> >>
> >> >I am a bit curious.
> >> >0.94 doesn't have BucketCache.
> >> >
> >> >Can you share BucketCache related config parameters in your cluster ?
> >> >
> >> >Cheers
> >> >
> >> >On Thu, Oct 15, 2015 at 11:11 AM, Randy Fox <rf...@connexity.com>
> wrote:
> >> >
> >> >>
> >> >> "StoreFileCloserThread-L-1" prio=10 tid=0x00000000027ec800 nid=0xad84
> >> >> runnable [0x00007fbcc0c65000]
> >> >>    java.lang.Thread.State: RUNNABLE
> >> >>         at java.util.LinkedList.indexOf(LinkedList.java:602)
> >> >>         at java.util.LinkedList.contains(LinkedList.java:315)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator$BucketSizeInfo.freeBlock(BucketAllocator.java:247)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.freeBlock(BucketAllocator.java:449)
> >> >>         - locked <0x000000041b0887a8> (a
> >> >> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlock(BucketCache.java:459)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlocksByHfileName(BucketCache.java:1036)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.evictBlocksByHfileName(CombinedBlockCache.java:90)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.close(HFileReaderV2.java:516)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.close(StoreFile.java:1143)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hbase.regionserver.StoreFile.closeReader(StoreFile.java:503)
> >> >>         - locked <0x00000004944ff2d8> (a
> >> >> org.apache.hadoop.hbase.regionserver.StoreFile)
> >> >>         at
> >> >> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:873)
> >> >>         at
> >> >> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:870)
> >> >>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >>         at
> >> >>
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >> >>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >>         at
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >>         at
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >>         at java.lang.Thread.run(Thread.java:745)
> >> >>
> >> >>
> >>
> "StoreCloserThread-Wildfire_graph3,\x00\x04lK\x1B\xFC\x10\xD2,1402949830657.afb6a1720d936a83d73022aeb9ddbb6c.-1"
> >> >> prio=10 tid=0x0000000003508800 nid=0xad83 waiting on condition
> >> >> [0x00007fbcc5dcc000]
> >> >>    java.lang.Thread.State: WAITING (parking)
> >> >>         at sun.misc.Unsafe.park(Native Method)
> >> >>         - parking to wait for  <0x0000000534e90a80> (a
> >> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >> >>         at
> >> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >> >>         at
> >> >>
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> >> >>         at
> >> >>
> >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> >> >>         at
> >> >>
> >>
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
> >> >>         at
> >> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:883)
> >> >>         at
> >> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:126)
> >> >>         at
> >> >>
> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1378)
> >> >>         at
> >> >>
> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1375)
> >> >>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >>         at
> >> >>
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >> >>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >> >>         at
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >>         at
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >>         at java.lang.Thread.run(Thread.java:745)
> >> >>
> >> >>
> >> >> "RS_CLOSE_REGION-hb20:60020-0" prio=10 tid=0x00007fcec0142000
> nid=0x3056
> >> >> waiting on condition [0x00007fbcc2d87000]
> >> >>    java.lang.Thread.State: WAITING (parking)
> >> >>         at sun.misc.Unsafe.park(Native Method)
> >> >>         - parking to wait for  <0x0000000534e61360> (a
> >> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >> >>         at
> >> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >> >>         at
> >> >>
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> >> >>         at
> >> >>
> >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> >> >>         at
> >> >>
> >>
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
> >> >>         at
> >> >>
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1385)
> >> >>         at
> >> >> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1280)
> >> >>         - locked <0x000000042230fa68> (a java.lang.Object)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
> >> >>         at
> >> >>
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
> >> >>         at
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >>         at
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >>         at java.lang.Thread.run(Thread.java:745)
> >> >>
> >> >>
> >> >> I attached the whole thing as well.
> >> >>
> >> >> -r
> >> >>
> >> >>
> >> >> On 10/15/15, 10:39 AM, "Ted Yu" <yu...@gmail.com> wrote:
> >> >>
> >> >> >Can you give a bit more detail on why block eviction was cause for
> the
> >> >> slow region movement?
> >> >> >
> >> >> >Did you happen to take stack traces ?
> >> >> >
> >> >> >Thanks
> >> >> >
> >> >> >> On Oct 15, 2015, at 10:32 AM, Randy Fox <rf...@connexity.com>
> wrote:
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >> We just upgraded from 0.94 to 1.0.0 and have noticed that region
> >> moves
> >> >> are super slow (order of minutes) whereas previously they where in
> the
> >> >> seconds range.  After looking at the code, I think the time is spent
> >> >> waiting for the blocks to be evicted from block cache.
> >> >> >>
> >> >> >> I wanted to verify that this theory is correct and see if there is
> >> >> anything that can be done to speed up the moves.
> >> >> >>
> >> >> >> This is particular painful as we are trying to get our configs
> tuned
> >> to
> >> >> the new SW and need to do rolling restarts which is taking almost 24
> >> hours
> >> >> on our cluster.  We also do our own manual rebalancing of regions
> across
> >> >> RS’s and that task is also now painful.
> >> >> >>
> >> >> >>
> >> >> >> Thanks,
> >> >> >>
> >> >> >> Randy
> >> >>
> >>
>

Re: Slow region moves

Posted by Randy Fox <rf...@connexity.com>.

Hi Vlad,

I tried it on a table and on a RegionServer basis and it appears to have no affect.  
Are we sure it is supported for bucket cache?  From my charts the bucket cache is getting cleared at the same time as the region moves occurred.  The regions slow to move are the ones with bucket cache.  

I took a table with 102 regions and blockcache true and turned off block cache via alter while the table is enabled - it took 19 minutes.  To turn block cache back on took 4.3 seconds.

Let me know if there is anything else to try.  This issue is really hurting our day to day ops.

Thanks,

Randy



On 10/15/15, 3:55 PM, "Vladimir Rodionov" <vl...@gmail.com> wrote:

>Hey, Randy
>
>You can verify your hypothesis by setting hbase.rs.evictblocksonclose to
>false for your tables.
>
>-Vlad
>
>On Thu, Oct 15, 2015 at 1:06 PM, Randy Fox <rf...@connexity.com> wrote:
>
>> Caveat - we are trying to tune the BucketCache (probably a new thread - as
>> we are not sure we are getting the most out of it)
>> 72G off heap
>>
>> <property>
>>    <name>hfile.block.cache.size</name>
>>    <value>0.58</value>
>> </property>
>>
>> <property>
>>    <name>hbase.bucketcache.ioengine</name>
>>    <value>offheap</value>
>> </property>
>>
>> <property>
>>    <name>hbase.bucketcache.size</name>
>>    <value>72800</value>
>> </property>
>>
>> <property>
>>    <name>hbase.bucketcache.bucket.sizes</name>
>>    <value>9216,17408,33792,66560</value>
>> </property>
>>
>>
>>
>>
>>
>>
>> On 10/15/15, 12:00 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>
>> >I am a bit curious.
>> >0.94 doesn't have BucketCache.
>> >
>> >Can you share BucketCache related config parameters in your cluster ?
>> >
>> >Cheers
>> >
>> >On Thu, Oct 15, 2015 at 11:11 AM, Randy Fox <rf...@connexity.com> wrote:
>> >
>> >>
>> >> "StoreFileCloserThread-L-1" prio=10 tid=0x00000000027ec800 nid=0xad84
>> >> runnable [0x00007fbcc0c65000]
>> >>    java.lang.Thread.State: RUNNABLE
>> >>         at java.util.LinkedList.indexOf(LinkedList.java:602)
>> >>         at java.util.LinkedList.contains(LinkedList.java:315)
>> >>         at
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator$BucketSizeInfo.freeBlock(BucketAllocator.java:247)
>> >>         at
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.freeBlock(BucketAllocator.java:449)
>> >>         - locked <0x000000041b0887a8> (a
>> >> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator)
>> >>         at
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlock(BucketCache.java:459)
>> >>         at
>> >>
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlocksByHfileName(BucketCache.java:1036)
>> >>         at
>> >>
>> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.evictBlocksByHfileName(CombinedBlockCache.java:90)
>> >>         at
>> >>
>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.close(HFileReaderV2.java:516)
>> >>         at
>> >>
>> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.close(StoreFile.java:1143)
>> >>         at
>> >>
>> org.apache.hadoop.hbase.regionserver.StoreFile.closeReader(StoreFile.java:503)
>> >>         - locked <0x00000004944ff2d8> (a
>> >> org.apache.hadoop.hbase.regionserver.StoreFile)
>> >>         at
>> >> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:873)
>> >>         at
>> >> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:870)
>> >>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >>         at
>> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> >>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >>         at
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >>         at
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >>         at java.lang.Thread.run(Thread.java:745)
>> >>
>> >>
>> "StoreCloserThread-Wildfire_graph3,\x00\x04lK\x1B\xFC\x10\xD2,1402949830657.afb6a1720d936a83d73022aeb9ddbb6c.-1"
>> >> prio=10 tid=0x0000000003508800 nid=0xad83 waiting on condition
>> >> [0x00007fbcc5dcc000]
>> >>    java.lang.Thread.State: WAITING (parking)
>> >>         at sun.misc.Unsafe.park(Native Method)
>> >>         - parking to wait for  <0x0000000534e90a80> (a
>> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> >>         at
>> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>> >>         at
>> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>> >>         at
>> >>
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> >>         at
>> >>
>> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>> >>         at
>> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:883)
>> >>         at
>> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:126)
>> >>         at
>> >> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1378)
>> >>         at
>> >> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1375)
>> >>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >>         at
>> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> >>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >>         at
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >>         at
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >>         at java.lang.Thread.run(Thread.java:745)
>> >>
>> >>
>> >> "RS_CLOSE_REGION-hb20:60020-0" prio=10 tid=0x00007fcec0142000 nid=0x3056
>> >> waiting on condition [0x00007fbcc2d87000]
>> >>    java.lang.Thread.State: WAITING (parking)
>> >>         at sun.misc.Unsafe.park(Native Method)
>> >>         - parking to wait for  <0x0000000534e61360> (a
>> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> >>         at
>> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>> >>         at
>> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>> >>         at
>> >>
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> >>         at
>> >>
>> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>> >>         at
>> >> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1385)
>> >>         at
>> >> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1280)
>> >>         - locked <0x000000042230fa68> (a java.lang.Object)
>> >>         at
>> >>
>> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
>> >>         at
>> >> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
>> >>         at
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >>         at
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >>         at java.lang.Thread.run(Thread.java:745)
>> >>
>> >>
>> >> I attached the whole thing as well.
>> >>
>> >> -r
>> >>
>> >>
>> >> On 10/15/15, 10:39 AM, "Ted Yu" <yu...@gmail.com> wrote:
>> >>
>> >> >Can you give a bit more detail on why block eviction was cause for the
>> >> slow region movement?
>> >> >
>> >> >Did you happen to take stack traces ?
>> >> >
>> >> >Thanks
>> >> >
>> >> >> On Oct 15, 2015, at 10:32 AM, Randy Fox <rf...@connexity.com> wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> We just upgraded from 0.94 to 1.0.0 and have noticed that region
>> moves
>> >> are super slow (order of minutes) whereas previously they where in the
>> >> seconds range.  After looking at the code, I think the time is spent
>> >> waiting for the blocks to be evicted from block cache.
>> >> >>
>> >> >> I wanted to verify that this theory is correct and see if there is
>> >> anything that can be done to speed up the moves.
>> >> >>
>> >> >> This is particular painful as we are trying to get our configs tuned
>> to
>> >> the new SW and need to do rolling restarts which is taking almost 24
>> hours
>> >> on our cluster.  We also do our own manual rebalancing of regions across
>> >> RS’s and that task is also now painful.
>> >> >>
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> Randy
>> >>
>>

Re: Slow region moves

Posted by Vladimir Rodionov <vl...@gmail.com>.

Hey, Randy

You can verify your hypothesis by setting hbase.rs.evictblocksonclose to
false for your tables.

-Vlad

On Thu, Oct 15, 2015 at 1:06 PM, Randy Fox <rf...@connexity.com> wrote:

> Caveat - we are trying to tune the BucketCache (probably a new thread - as
> we are not sure we are getting the most out of it)
> 72G off heap
>
> <property>
>    <name>hfile.block.cache.size</name>
>    <value>0.58</value>
> </property>
>
> <property>
>    <name>hbase.bucketcache.ioengine</name>
>    <value>offheap</value>
> </property>
>
> <property>
>    <name>hbase.bucketcache.size</name>
>    <value>72800</value>
> </property>
>
> <property>
>    <name>hbase.bucketcache.bucket.sizes</name>
>    <value>9216,17408,33792,66560</value>
> </property>
>
>
>
>
>
>
> On 10/15/15, 12:00 PM, "Ted Yu" <yu...@gmail.com> wrote:
>
> >I am a bit curious.
> >0.94 doesn't have BucketCache.
> >
> >Can you share BucketCache related config parameters in your cluster ?
> >
> >Cheers
> >
> >On Thu, Oct 15, 2015 at 11:11 AM, Randy Fox <rf...@connexity.com> wrote:
> >
> >>
> >> "StoreFileCloserThread-L-1" prio=10 tid=0x00000000027ec800 nid=0xad84
> >> runnable [0x00007fbcc0c65000]
> >>    java.lang.Thread.State: RUNNABLE
> >>         at java.util.LinkedList.indexOf(LinkedList.java:602)
> >>         at java.util.LinkedList.contains(LinkedList.java:315)
> >>         at
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator$BucketSizeInfo.freeBlock(BucketAllocator.java:247)
> >>         at
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.freeBlock(BucketAllocator.java:449)
> >>         - locked <0x000000041b0887a8> (a
> >> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator)
> >>         at
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlock(BucketCache.java:459)
> >>         at
> >>
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlocksByHfileName(BucketCache.java:1036)
> >>         at
> >>
> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.evictBlocksByHfileName(CombinedBlockCache.java:90)
> >>         at
> >>
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.close(HFileReaderV2.java:516)
> >>         at
> >>
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.close(StoreFile.java:1143)
> >>         at
> >>
> org.apache.hadoop.hbase.regionserver.StoreFile.closeReader(StoreFile.java:503)
> >>         - locked <0x00000004944ff2d8> (a
> >> org.apache.hadoop.hbase.regionserver.StoreFile)
> >>         at
> >> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:873)
> >>         at
> >> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:870)
> >>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >>         at
> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >>         at
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>         at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>         at java.lang.Thread.run(Thread.java:745)
> >>
> >>
> "StoreCloserThread-Wildfire_graph3,\x00\x04lK\x1B\xFC\x10\xD2,1402949830657.afb6a1720d936a83d73022aeb9ddbb6c.-1"
> >> prio=10 tid=0x0000000003508800 nid=0xad83 waiting on condition
> >> [0x00007fbcc5dcc000]
> >>    java.lang.Thread.State: WAITING (parking)
> >>         at sun.misc.Unsafe.park(Native Method)
> >>         - parking to wait for  <0x0000000534e90a80> (a
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >>         at
> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >>         at
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> >>         at
> >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> >>         at
> >>
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
> >>         at
> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:883)
> >>         at
> >> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:126)
> >>         at
> >> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1378)
> >>         at
> >> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1375)
> >>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >>         at
> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >>         at
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>         at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>         at java.lang.Thread.run(Thread.java:745)
> >>
> >>
> >> "RS_CLOSE_REGION-hb20:60020-0" prio=10 tid=0x00007fcec0142000 nid=0x3056
> >> waiting on condition [0x00007fbcc2d87000]
> >>    java.lang.Thread.State: WAITING (parking)
> >>         at sun.misc.Unsafe.park(Native Method)
> >>         - parking to wait for  <0x0000000534e61360> (a
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >>         at
> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >>         at
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> >>         at
> >>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> >>         at
> >>
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
> >>         at
> >> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1385)
> >>         at
> >> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1280)
> >>         - locked <0x000000042230fa68> (a java.lang.Object)
> >>         at
> >>
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
> >>         at
> >> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
> >>         at
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>         at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>         at java.lang.Thread.run(Thread.java:745)
> >>
> >>
> >> I attached the whole thing as well.
> >>
> >> -r
> >>
> >>
> >> On 10/15/15, 10:39 AM, "Ted Yu" <yu...@gmail.com> wrote:
> >>
> >> >Can you give a bit more detail on why block eviction was cause for the
> >> slow region movement?
> >> >
> >> >Did you happen to take stack traces ?
> >> >
> >> >Thanks
> >> >
> >> >> On Oct 15, 2015, at 10:32 AM, Randy Fox <rf...@connexity.com> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> We just upgraded from 0.94 to 1.0.0 and have noticed that region
> moves
> >> are super slow (order of minutes) whereas previously they where in the
> >> seconds range.  After looking at the code, I think the time is spent
> >> waiting for the blocks to be evicted from block cache.
> >> >>
> >> >> I wanted to verify that this theory is correct and see if there is
> >> anything that can be done to speed up the moves.
> >> >>
> >> >> This is particular painful as we are trying to get our configs tuned
> to
> >> the new SW and need to do rolling restarts which is taking almost 24
> hours
> >> on our cluster.  We also do our own manual rebalancing of regions across
> >> RS’s and that task is also now painful.
> >> >>
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Randy
> >>
>

Re: Slow region moves

Posted by Randy Fox <rf...@connexity.com>.

Caveat - we are trying to tune the BucketCache (probably a new thread - as we are not sure we are getting the most out of it)
72G off heap

<property>
   <name>hfile.block.cache.size</name>
   <value>0.58</value>
</property>

<property>
   <name>hbase.bucketcache.ioengine</name>
   <value>offheap</value>
</property>

<property>
   <name>hbase.bucketcache.size</name>
   <value>72800</value>
</property>

<property>
   <name>hbase.bucketcache.bucket.sizes</name>
   <value>9216,17408,33792,66560</value>
</property>






On 10/15/15, 12:00 PM, "Ted Yu" <yu...@gmail.com> wrote:

>I am a bit curious.
>0.94 doesn't have BucketCache.
>
>Can you share BucketCache related config parameters in your cluster ?
>
>Cheers
>
>On Thu, Oct 15, 2015 at 11:11 AM, Randy Fox <rf...@connexity.com> wrote:
>
>>
>> "StoreFileCloserThread-L-1" prio=10 tid=0x00000000027ec800 nid=0xad84
>> runnable [0x00007fbcc0c65000]
>>    java.lang.Thread.State: RUNNABLE
>>         at java.util.LinkedList.indexOf(LinkedList.java:602)
>>         at java.util.LinkedList.contains(LinkedList.java:315)
>>         at
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator$BucketSizeInfo.freeBlock(BucketAllocator.java:247)
>>         at
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.freeBlock(BucketAllocator.java:449)
>>         - locked <0x000000041b0887a8> (a
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator)
>>         at
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlock(BucketCache.java:459)
>>         at
>> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlocksByHfileName(BucketCache.java:1036)
>>         at
>> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.evictBlocksByHfileName(CombinedBlockCache.java:90)
>>         at
>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.close(HFileReaderV2.java:516)
>>         at
>> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.close(StoreFile.java:1143)
>>         at
>> org.apache.hadoop.hbase.regionserver.StoreFile.closeReader(StoreFile.java:503)
>>         - locked <0x00000004944ff2d8> (a
>> org.apache.hadoop.hbase.regionserver.StoreFile)
>>         at
>> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:873)
>>         at
>> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:870)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>         at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> "StoreCloserThread-Wildfire_graph3,\x00\x04lK\x1B\xFC\x10\xD2,1402949830657.afb6a1720d936a83d73022aeb9ddbb6c.-1"
>> prio=10 tid=0x0000000003508800 nid=0xad83 waiting on condition
>> [0x00007fbcc5dcc000]
>>    java.lang.Thread.State: WAITING (parking)
>>         at sun.misc.Unsafe.park(Native Method)
>>         - parking to wait for  <0x0000000534e90a80> (a
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>>         at
>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>>         at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>>         at
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>>         at
>> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>>         at
>> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:883)
>>         at
>> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:126)
>>         at
>> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1378)
>>         at
>> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1375)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>         at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>         at java.lang.Thread.run(Thread.java:745)
>>
>>
>> "RS_CLOSE_REGION-hb20:60020-0" prio=10 tid=0x00007fcec0142000 nid=0x3056
>> waiting on condition [0x00007fbcc2d87000]
>>    java.lang.Thread.State: WAITING (parking)
>>         at sun.misc.Unsafe.park(Native Method)
>>         - parking to wait for  <0x0000000534e61360> (a
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>>         at
>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>>         at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>>         at
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>>         at
>> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>>         at
>> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1385)
>>         at
>> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1280)
>>         - locked <0x000000042230fa68> (a java.lang.Object)
>>         at
>> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
>>         at
>> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>         at java.lang.Thread.run(Thread.java:745)
>>
>>
>> I attached the whole thing as well.
>>
>> -r
>>
>>
>> On 10/15/15, 10:39 AM, "Ted Yu" <yu...@gmail.com> wrote:
>>
>> >Can you give a bit more detail on why block eviction was cause for the
>> slow region movement?
>> >
>> >Did you happen to take stack traces ?
>> >
>> >Thanks
>> >
>> >> On Oct 15, 2015, at 10:32 AM, Randy Fox <rf...@connexity.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> We just upgraded from 0.94 to 1.0.0 and have noticed that region moves
>> are super slow (order of minutes) whereas previously they where in the
>> seconds range.  After looking at the code, I think the time is spent
>> waiting for the blocks to be evicted from block cache.
>> >>
>> >> I wanted to verify that this theory is correct and see if there is
>> anything that can be done to speed up the moves.
>> >>
>> >> This is particular painful as we are trying to get our configs tuned to
>> the new SW and need to do rolling restarts which is taking almost 24 hours
>> on our cluster.  We also do our own manual rebalancing of regions across
>> RS’s and that task is also now painful.
>> >>
>> >>
>> >> Thanks,
>> >>
>> >> Randy
>>

Re: Slow region moves

Posted by Ted Yu <yu...@gmail.com>.

I am a bit curious.
0.94 doesn't have BucketCache.

Can you share BucketCache related config parameters in your cluster ?

Cheers

On Thu, Oct 15, 2015 at 11:11 AM, Randy Fox <rf...@connexity.com> wrote:

>
> "StoreFileCloserThread-L-1" prio=10 tid=0x00000000027ec800 nid=0xad84
> runnable [0x00007fbcc0c65000]
>    java.lang.Thread.State: RUNNABLE
>         at java.util.LinkedList.indexOf(LinkedList.java:602)
>         at java.util.LinkedList.contains(LinkedList.java:315)
>         at
> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator$BucketSizeInfo.freeBlock(BucketAllocator.java:247)
>         at
> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.freeBlock(BucketAllocator.java:449)
>         - locked <0x000000041b0887a8> (a
> org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator)
>         at
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlock(BucketCache.java:459)
>         at
> org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlocksByHfileName(BucketCache.java:1036)
>         at
> org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.evictBlocksByHfileName(CombinedBlockCache.java:90)
>         at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.close(HFileReaderV2.java:516)
>         at
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.close(StoreFile.java:1143)
>         at
> org.apache.hadoop.hbase.regionserver.StoreFile.closeReader(StoreFile.java:503)
>         - locked <0x00000004944ff2d8> (a
> org.apache.hadoop.hbase.regionserver.StoreFile)
>         at
> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:873)
>         at
> org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:870)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
> "StoreCloserThread-Wildfire_graph3,\x00\x04lK\x1B\xFC\x10\xD2,1402949830657.afb6a1720d936a83d73022aeb9ddbb6c.-1"
> prio=10 tid=0x0000000003508800 nid=0xad83 waiting on condition
> [0x00007fbcc5dcc000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x0000000534e90a80> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>         at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>         at
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>         at
> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:883)
>         at
> org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:126)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1378)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1375)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
>
> "RS_CLOSE_REGION-hb20:60020-0" prio=10 tid=0x00007fcec0142000 nid=0x3056
> waiting on condition [0x00007fbcc2d87000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x0000000534e61360> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>         at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>         at
> java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1385)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1280)
>         - locked <0x000000042230fa68> (a java.lang.Object)
>         at
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
>
> I attached the whole thing as well.
>
> -r
>
>
> On 10/15/15, 10:39 AM, "Ted Yu" <yu...@gmail.com> wrote:
>
> >Can you give a bit more detail on why block eviction was cause for the
> slow region movement?
> >
> >Did you happen to take stack traces ?
> >
> >Thanks
> >
> >> On Oct 15, 2015, at 10:32 AM, Randy Fox <rf...@connexity.com> wrote:
> >>
> >> Hi,
> >>
> >> We just upgraded from 0.94 to 1.0.0 and have noticed that region moves
> are super slow (order of minutes) whereas previously they where in the
> seconds range.  After looking at the code, I think the time is spent
> waiting for the blocks to be evicted from block cache.
> >>
> >> I wanted to verify that this theory is correct and see if there is
> anything that can be done to speed up the moves.
> >>
> >> This is particular painful as we are trying to get our configs tuned to
> the new SW and need to do rolling restarts which is taking almost 24 hours
> on our cluster.  We also do our own manual rebalancing of regions across
> RS’s and that task is also now painful.
> >>
> >>
> >> Thanks,
> >>
> >> Randy
>

Re: Slow region moves

Posted by Randy Fox <rf...@connexity.com>.

"StoreFileCloserThread-L-1" prio=10 tid=0x00000000027ec800 nid=0xad84 runnable [0x00007fbcc0c65000]
   java.lang.Thread.State: RUNNABLE
        at java.util.LinkedList.indexOf(LinkedList.java:602)
        at java.util.LinkedList.contains(LinkedList.java:315)
        at org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator$BucketSizeInfo.freeBlock(BucketAllocator.java:247)
        at org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.freeBlock(BucketAllocator.java:449)
        - locked <0x000000041b0887a8> (a org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator)
        at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlock(BucketCache.java:459)
        at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.evictBlocksByHfileName(BucketCache.java:1036)
        at org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.evictBlocksByHfileName(CombinedBlockCache.java:90)
        at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.close(HFileReaderV2.java:516)
        at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.close(StoreFile.java:1143)
        at org.apache.hadoop.hbase.regionserver.StoreFile.closeReader(StoreFile.java:503)
        - locked <0x00000004944ff2d8> (a org.apache.hadoop.hbase.regionserver.StoreFile)
        at org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:873)
        at org.apache.hadoop.hbase.regionserver.HStore$2.call(HStore.java:870)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

"StoreCloserThread-Wildfire_graph3,\x00\x04lK\x1B\xFC\x10\xD2,1402949830657.afb6a1720d936a83d73022aeb9ddbb6c.-1" prio=10 tid=0x0000000003508800 nid=0xad83 waiting on condition [0x00007fbcc5dcc000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000534e90a80> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
        at org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:883)
        at org.apache.hadoop.hbase.regionserver.HStore.close(HStore.java:126)
        at org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1378)
        at org.apache.hadoop.hbase.regionserver.HRegion$2.call(HRegion.java:1375)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)


"RS_CLOSE_REGION-hb20:60020-0" prio=10 tid=0x00007fcec0142000 nid=0x3056 waiting on condition [0x00007fbcc2d87000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000534e61360> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
        at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1385)
        at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1280)
        - locked <0x000000042230fa68> (a java.lang.Object)
        at org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)


I attached the whole thing as well.

-r


On 10/15/15, 10:39 AM, "Ted Yu" <yu...@gmail.com> wrote:

>Can you give a bit more detail on why block eviction was cause for the slow region movement?
>
>Did you happen to take stack traces ?
>
>Thanks
>
>> On Oct 15, 2015, at 10:32 AM, Randy Fox <rf...@connexity.com> wrote:
>> 
>> Hi,
>> 
>> We just upgraded from 0.94 to 1.0.0 and have noticed that region moves are super slow (order of minutes) whereas previously they where in the seconds range.  After looking at the code, I think the time is spent waiting for the blocks to be evicted from block cache.
>> 
>> I wanted to verify that this theory is correct and see if there is anything that can be done to speed up the moves.
>> 
>> This is particular painful as we are trying to get our configs tuned to the new SW and need to do rolling restarts which is taking almost 24 hours on our cluster.  We also do our own manual rebalancing of regions across RS’s and that task is also now painful.
>> 
>> 
>> Thanks,
>> 
>> Randy

Re: Slow region moves

Posted by Vladimir Rodionov <vl...@gmail.com>.

That is not 0.94 vs 1.0 issue - its BucketCache vs LRUCache issue.

It seems that, BucketCache.freeBlock is expensive (and source of
contention). It is the issue. Please open JIRA.

-Vlad

On Thu, Oct 15, 2015 at 11:29 AM, Randy Fox <rf...@connexity.com> wrote:

> Here is the region server log distilled down to events pertinent to the
> region move.  Not much being logged while the Store is being closed.
> This region was about 1.4G compressed, which I consider small.
>
> 2015-10-15 08:34:28,510 INFO
> org.apache.hadoop.hbase.regionserver.RSRpcServices: Close
> dad6c71ed395df19220ef1056a110086, moving to hb17.prod1.connexity.net
> ,60020,1444425924021
> 2015-10-15 08:34:28,510 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing
> close of
> Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.
> 2015-10-15 08:34:28,511 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Closing
> Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.:
> disabling compactions & flushes
> 2015-10-15 08:34:28,511 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Running close preflush of
> Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.
> 2015-10-15 08:34:28,511 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Started memstore flush for
> Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.,
> current region memstore size 44.89 MB, and 1/1 column families' memstores
> are being flushed.
> 2015-10-15 08:34:28,511 WARN
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Couldn't find oldest
> seqNum for the region we are about to flush:
> [dad6c71ed395df19220ef1056a110086]
> 2015-10-15 08:34:29,137 INFO
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed,
> sequenceid=70839856845, memsize=44.9 M, hasBloomFilter=false, into tmp file
> hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/.tmp/3600b47839a945de9733cf17581458e0
> 2015-10-15 08:34:29,144 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegionFileSystem: Committing store
> file
> hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/.tmp/3600b47839a945de9733cf17581458e0
> as
> hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/L/3600b47839a945de9733cf17581458e0
> 2015-10-15 08:34:29,150 INFO org.apache.hadoop.hbase.regionserver.HStore:
> Added
> hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/L/3600b47839a945de9733cf17581458e0,
> entries=190131, sequenceid=70839856845, filesize=3.2 M
> 2015-10-15 08:34:29,151 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Finished memstore flush of ~44.89 MB/47066832, currentsize=22.39 KB/22928
> for region
> Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.
> in 640ms, sequenceid=70839856845, compaction requested=false
> 2015-10-15 08:34:29,152 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for region
> Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.
> 2015-10-15 08:34:29,152 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Started memstore flush for
> Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.,
> current region memstore size 22.39 KB, and 1/1 column families' memstores
> are being flushed.
> 2015-10-15 08:34:29,152 WARN
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Couldn't find oldest
> seqNum for the region we are about to flush:
> [dad6c71ed395df19220ef1056a110086]
> 2015-10-15 08:34:29,225 INFO
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed,
> sequenceid=70839856876, memsize=22.4 K, hasBloomFilter=false, into tmp file
> hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/.tmp/a8a74549c7a44174b105cbed23cfaea1
> 2015-10-15 08:34:29,231 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegionFileSystem: Committing store
> file
> hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/.tmp/a8a74549c7a44174b105cbed23cfaea1
> as
> hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/L/a8a74549c7a44174b105cbed23cfaea1
> 2015-10-15 08:34:29,279 INFO org.apache.hadoop.hbase.regionserver.HStore:
> Added
> hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/L/a8a74549c7a44174b105cbed23cfaea1,
> entries=128, sequenceid=70839856876, filesize=2.6 K
> 2015-10-15 08:34:29,280 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Finished memstore flush of ~22.39 KB/22928, currentsize=0 B/0 for region
> Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.
> in 128ms, sequenceid=70839856876, compaction requested=true
>
>
> 2015-10-15 08:37:02,945 INFO org.apache.hadoop.hbase.regionserver.HStore:
> Closed L
> 2015-10-15 08:37:02,954 DEBUG org.apache.hadoop.hbase.wal.WALSplitter:
> Wrote region
> seqId=hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/recovered.edits/70839856879.seqid
> to file, newSeqId=70839856879, maxSeqId=70838935131
> 2015-10-15 08:37:02,955 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Closed
> Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.
> 2015-10-15 08:37:02,955 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Adding moved region
> record: dad6c71ed395df19220ef1056a110086 to hb17.prod1.connexity.net,60020,1444425924021
> as of 70838935131
> 2015-10-15 08:37:02,956 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:60020-0x45049a3c1b80a8b, quorum=ccm2.prod1.connexity.net:2181
> ,ccm1.prod1.connexity.net:2181,db2.prod1.connexity.net:2181,
> sql2.prod1.connexity.net:2181,ccm3.prod1.connexity.net:2181,
> baseZNode=/hbase Transitioning dad6c71ed395df19220ef1056a110086 from
> M_ZK_REGION_CLOSING to RS_ZK_REGION_CLOSED
> 2015-10-15 08:37:02,967 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:60020-0x45049a3c1b80a8b, quorum=ccm2.prod1.connexity.net:2181
> ,ccm1.prod1.connexity.net:2181,db2.prod1.connexity.net:2181,
> sql2.prod1.connexity.net:2181,ccm3.prod1.connexity.net:2181,
> baseZNode=/hbase Transitioned node dad6c71ed395df19220ef1056a110086 from
> M_ZK_REGION_CLOSING to RS_ZK_REGION_CLOSED
> 2015-10-15 08:37:02,967 DEBUG
> org.apache.hadoop.hbase.coordination.ZkCloseRegionCoordination: Set closed
> state in zk for
> Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.
> on hb18.prod1.connexity.net,60020,1444427168636
> 2015-10-15 08:37:02,967 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Closed
> Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.
> ~
>
>
>
>
>
>
> On 10/15/15, 11:05 AM, "Esteban Gutierrez" <es...@cloudera.com> wrote:
>
> >Hi Randy,
> >
> >As Ted suggested a couple of stack traces would be helpful but if you
> could
> >attach the HMaster and a couple of RS logs where you see this problem that
> >would be great. Also, do you have a large regions that might need some
> time
> >to flush? have you looked into any potential problem in HDFS?
> >
> >thanks,
> >esteban.
> >
> >--
> >Cloudera, Inc.
> >
> >
> >On Thu, Oct 15, 2015 at 10:39 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> Can you give a bit more detail on why block eviction was cause for the
> >> slow region movement?
> >>
> >> Did you happen to take stack traces ?
> >>
> >> Thanks
> >>
> >> > On Oct 15, 2015, at 10:32 AM, Randy Fox <rf...@connexity.com> wrote:
> >> >
> >> > Hi,
> >> >
> >> > We just upgraded from 0.94 to 1.0.0 and have noticed that region moves
> >> are super slow (order of minutes) whereas previously they where in the
> >> seconds range.  After looking at the code, I think the time is spent
> >> waiting for the blocks to be evicted from block cache.
> >> >
> >> > I wanted to verify that this theory is correct and see if there is
> >> anything that can be done to speed up the moves.
> >> >
> >> > This is particular painful as we are trying to get our configs tuned
> to
> >> the new SW and need to do rolling restarts which is taking almost 24
> hours
> >> on our cluster.  We also do our own manual rebalancing of regions across
> >> RS’s and that task is also now painful.
> >> >
> >> >
> >> > Thanks,
> >> >
> >> > Randy
> >>
>

Re: Slow region moves

Posted by Randy Fox <rf...@connexity.com>.

Here is the region server log distilled down to events pertinent to the region move.  Not much being logged while the Store is being closed.
This region was about 1.4G compressed, which I consider small.

2015-10-15 08:34:28,510 INFO org.apache.hadoop.hbase.regionserver.RSRpcServices: Close dad6c71ed395df19220ef1056a110086, moving to hb17.prod1.connexity.net,60020,1444425924021
2015-10-15 08:34:28,510 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.
2015-10-15 08:34:28,511 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.: disabling compactions & flushes
2015-10-15 08:34:28,511 INFO org.apache.hadoop.hbase.regionserver.HRegion: Running close preflush of Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.
2015-10-15 08:34:28,511 INFO org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086., current region memstore size 44.89 MB, and 1/1 column families' memstores are being flushed.
2015-10-15 08:34:28,511 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: Couldn't find oldest seqNum for the region we are about to flush: [dad6c71ed395df19220ef1056a110086]
2015-10-15 08:34:29,137 INFO org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed, sequenceid=70839856845, memsize=44.9 M, hasBloomFilter=false, into tmp file hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/.tmp/3600b47839a945de9733cf17581458e0
2015-10-15 08:34:29,144 DEBUG org.apache.hadoop.hbase.regionserver.HRegionFileSystem: Committing store file hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/.tmp/3600b47839a945de9733cf17581458e0 as hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/L/3600b47839a945de9733cf17581458e0
2015-10-15 08:34:29,150 INFO org.apache.hadoop.hbase.regionserver.HStore: Added hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/L/3600b47839a945de9733cf17581458e0, entries=190131, sequenceid=70839856845, filesize=3.2 M
2015-10-15 08:34:29,151 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~44.89 MB/47066832, currentsize=22.39 KB/22928 for region Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086. in 640ms, sequenceid=70839856845, compaction requested=false
2015-10-15 08:34:29,152 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for region Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.
2015-10-15 08:34:29,152 INFO org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086., current region memstore size 22.39 KB, and 1/1 column families' memstores are being flushed.
2015-10-15 08:34:29,152 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: Couldn't find oldest seqNum for the region we are about to flush: [dad6c71ed395df19220ef1056a110086]
2015-10-15 08:34:29,225 INFO org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed, sequenceid=70839856876, memsize=22.4 K, hasBloomFilter=false, into tmp file hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/.tmp/a8a74549c7a44174b105cbed23cfaea1
2015-10-15 08:34:29,231 DEBUG org.apache.hadoop.hbase.regionserver.HRegionFileSystem: Committing store file hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/.tmp/a8a74549c7a44174b105cbed23cfaea1 as hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/L/a8a74549c7a44174b105cbed23cfaea1
2015-10-15 08:34:29,279 INFO org.apache.hadoop.hbase.regionserver.HStore: Added hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/L/a8a74549c7a44174b105cbed23cfaea1, entries=128, sequenceid=70839856876, filesize=2.6 K
2015-10-15 08:34:29,280 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~22.39 KB/22928, currentsize=0 B/0 for region Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086. in 128ms, sequenceid=70839856876, compaction requested=true


2015-10-15 08:37:02,945 INFO org.apache.hadoop.hbase.regionserver.HStore: Closed L
2015-10-15 08:37:02,954 DEBUG org.apache.hadoop.hbase.wal.WALSplitter: Wrote region seqId=hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/recovered.edits/70839856879.seqid to file, newSeqId=70839856879, maxSeqId=70838935131
2015-10-15 08:37:02,955 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.
2015-10-15 08:37:02,955 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Adding moved region record: dad6c71ed395df19220ef1056a110086 to hb17.prod1.connexity.net,60020,1444425924021 as of 70838935131
2015-10-15 08:37:02,956 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x45049a3c1b80a8b, quorum=ccm2.prod1.connexity.net:2181,ccm1.prod1.connexity.net:2181,db2.prod1.connexity.net:2181,sql2.prod1.connexity.net:2181,ccm3.prod1.connexity.net:2181, baseZNode=/hbase Transitioning dad6c71ed395df19220ef1056a110086 from M_ZK_REGION_CLOSING to RS_ZK_REGION_CLOSED
2015-10-15 08:37:02,967 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x45049a3c1b80a8b, quorum=ccm2.prod1.connexity.net:2181,ccm1.prod1.connexity.net:2181,db2.prod1.connexity.net:2181,sql2.prod1.connexity.net:2181,ccm3.prod1.connexity.net:2181, baseZNode=/hbase Transitioned node dad6c71ed395df19220ef1056a110086 from M_ZK_REGION_CLOSING to RS_ZK_REGION_CLOSED
2015-10-15 08:37:02,967 DEBUG org.apache.hadoop.hbase.coordination.ZkCloseRegionCoordination: Set closed state in zk for Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086. on hb18.prod1.connexity.net,60020,1444427168636
2015-10-15 08:37:02,967 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Closed Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.
~                                                                                                                                                                                                   






On 10/15/15, 11:05 AM, "Esteban Gutierrez" <es...@cloudera.com> wrote:

>Hi Randy,
>
>As Ted suggested a couple of stack traces would be helpful but if you could
>attach the HMaster and a couple of RS logs where you see this problem that
>would be great. Also, do you have a large regions that might need some time
>to flush? have you looked into any potential problem in HDFS?
>
>thanks,
>esteban.
>
>--
>Cloudera, Inc.
>
>
>On Thu, Oct 15, 2015 at 10:39 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> Can you give a bit more detail on why block eviction was cause for the
>> slow region movement?
>>
>> Did you happen to take stack traces ?
>>
>> Thanks
>>
>> > On Oct 15, 2015, at 10:32 AM, Randy Fox <rf...@connexity.com> wrote:
>> >
>> > Hi,
>> >
>> > We just upgraded from 0.94 to 1.0.0 and have noticed that region moves
>> are super slow (order of minutes) whereas previously they where in the
>> seconds range.  After looking at the code, I think the time is spent
>> waiting for the blocks to be evicted from block cache.
>> >
>> > I wanted to verify that this theory is correct and see if there is
>> anything that can be done to speed up the moves.
>> >
>> > This is particular painful as we are trying to get our configs tuned to
>> the new SW and need to do rolling restarts which is taking almost 24 hours
>> on our cluster.  We also do our own manual rebalancing of regions across
>> RS’s and that task is also now painful.
>> >
>> >
>> > Thanks,
>> >
>> > Randy
>>

Re: Slow region moves

Posted by Esteban Gutierrez <es...@cloudera.com>.

Hi Randy,

As Ted suggested a couple of stack traces would be helpful but if you could
attach the HMaster and a couple of RS logs where you see this problem that
would be great. Also, do you have a large regions that might need some time
to flush? have you looked into any potential problem in HDFS?

thanks,
esteban.

--
Cloudera, Inc.


On Thu, Oct 15, 2015 at 10:39 AM, Ted Yu <yu...@gmail.com> wrote:

> Can you give a bit more detail on why block eviction was cause for the
> slow region movement?
>
> Did you happen to take stack traces ?
>
> Thanks
>
> > On Oct 15, 2015, at 10:32 AM, Randy Fox <rf...@connexity.com> wrote:
> >
> > Hi,
> >
> > We just upgraded from 0.94 to 1.0.0 and have noticed that region moves
> are super slow (order of minutes) whereas previously they where in the
> seconds range.  After looking at the code, I think the time is spent
> waiting for the blocks to be evicted from block cache.
> >
> > I wanted to verify that this theory is correct and see if there is
> anything that can be done to speed up the moves.
> >
> > This is particular painful as we are trying to get our configs tuned to
> the new SW and need to do rolling restarts which is taking almost 24 hours
> on our cluster.  We also do our own manual rebalancing of regions across
> RS’s and that task is also now painful.
> >
> >
> > Thanks,
> >
> > Randy
>

Re: Slow region moves

Posted by Ted Yu <yu...@gmail.com>.

Can you give a bit more detail on why block eviction was cause for the slow region movement?

Did you happen to take stack traces ?

Thanks

> On Oct 15, 2015, at 10:32 AM, Randy Fox <rf...@connexity.com> wrote:
> 
> Hi,
> 
> We just upgraded from 0.94 to 1.0.0 and have noticed that region moves are super slow (order of minutes) whereas previously they where in the seconds range.  After looking at the code, I think the time is spent waiting for the blocks to be evicted from block cache.
> 
> I wanted to verify that this theory is correct and see if there is anything that can be done to speed up the moves.
> 
> This is particular painful as we are trying to get our configs tuned to the new SW and need to do rolling restarts which is taking almost 24 hours on our cluster.  We also do our own manual rebalancing of regions across RS’s and that task is also now painful.
> 
> 
> Thanks,
> 
> Randy