You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Brian Jeltema <br...@digitalenvoy.net> on 2014/07/21 14:03:44 UTC

snapshot timeout problem

I’m running HBase 0.98. I’m trying to snapshot a table, but it’s timing out after 60 seconds.
I increased the value of hbase.snapshot.master.timeoutMillis and restarted HBase,
but the timeout still happens after 60 seconds. Any suggestions?

Brian

Re: snapshot timeout problem

Posted by Ted Yu <yu...@gmail.com>.

You can leave your config value there.
Remember to record such change in a place for future reference - you may
change other cost parameter later.

The side-effects of this change partially depend on how you want your
cluster balanced. I suggest you go over the CostFunction's in
StochasticLoadBalancer
so that you know which factors (and their weights) load balancer considers.

Cheers


On Tue, Jul 22, 2014 at 8:43 AM, Brian Jeltema <
brian.jeltema@digitalenvoy.net> wrote:

> That did the trick. I set it to 100 and regions are uniform now. Should I
> leave it there? What are the side-effects of this change?
>
> Thanks.
>
> Brian
>
> On Jul 22, 2014, at 11:28 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > Here is code snippet from StochasticLoadBalancer
> > w.r.t. TableSkewCostFunction :
> >
> >    private static final String TABLE_SKEW_COST_KEY =
> >
> >        "hbase.master.balancer.stochastic.tableSkewCost";
> >
> >    private static final float DEFAULT_TABLE_SKEW_COST = 35;
> >
> >    TableSkewCostFunction(Configuration conf) {
> >
> >      super(conf);
> >
> >      this.setMultiplier(conf.getFloat(TABLE_SKEW_COST_KEY,
> > DEFAULT_TABLE_SKEW_COST));
> >
> > You can try increasing the value for
> > "hbase.master.balancer.stochastic.tableSkewCost"
> >
> >
> > Cheers
> >
> >
> > On Tue, Jul 22, 2014 at 6:59 AM, Brian Jeltema <
> > brian.jeltema@digitalenvoy.net> wrote:
> >
> >> I don’t understand the logging output, but I do see a strange pattern.
> >> I’ll try to summarize.
> >>
> >> There are 5 RegionServers, call them rs1 through rs5. There are a total
> of
> >> 174 regions for the table in question,
> >> with 69 in rs1. In the log output I see lines (greatly simplified) like
> >> the following:
> >>
> >>   AssignmentManager: Assigning fooTable, …. to rs2
> >>   AssignmentManager: Assigning fooTable, …. to rs3
> >>   AssignmentManager: Assigning fooTable, …. to rs4
> >>   AssignmentManager: Assigning fooTable, …. to rs5
> >>
> >> There are 106 such lines, none logging an assignment to rs1
> >>
> >> I also see 105 lines like:
> >>
> >>  AssignmentManager: Using pre-existing plan for fooTable … src=rs1 …
> >> dest=rs2
> >>  AssignmentManager: Using pre-existing plan for fooTable … src=rs1 …
> >> dest=rs3
> >>  …
> >>
> >> where src=rs1 in every case, and dest=rs1 never occurs.
> >>
> >> I don’t see any exceptions or log output that reports a problem.
> >>
> >>
> >> On Jul 22, 2014, at 9:18 AM, Ted Yu <yu...@gmail.com> wrote:
> >>
> >>> The load balancer in 0.98 considers many factors when making balancing
> >> decisions.
> >>>
> >>> Can you take a look at the master log and look for balancer related
> >> lines ?
> >>> That would give you some clue.
> >>>
> >>> Cheers
> >>>
> >>> On Jul 22, 2014, at 5:03 AM, Brian Jeltema <
> >> brian.jeltema@digitalenvoy.net> wrote:
> >>>
> >>>> I ran the balancer from hbase shell, but don’t see any change. Is
> there
> >> a way to balance a specific table?
> >>>>
> >>>>> bq. One RegionServer has 69 regions
> >>>>>
> >>>>> Can you run load balancer so that your regions are better balanced ?
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>>
> >>>>> On Mon, Jul 21, 2014 at 6:56 AM, Brian Jeltema <
> >>>>> brian.jeltema@digitalenvoy.net> wrote:
> >>>>>
> >>>>>> There are 174 regions, not well balanced. One RegionServer has 69
> >> regions.
> >>>>>> That RegionServer generates a
> >>>>>> series of log entries (modified and shown below), one for each
> >> region, at
> >>>>>> roughly 1 to 2 second intervals. The timeout period expires when
> >>>>>> it reaches region 36.
> >>>>>>
> >>>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating references
> for
> >>>>>> hfiles
> >>>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Adding snapshot
> >> references
> >>>>>> for [hdfs://
> >>>>>>
> >>
> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
> >> ]
> >>>>>> hfiles
> >>>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating reference for
> >> file
> >>>>>> (1/1) : hdfs://
> >>>>>>
> >>
> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
> >>>>>> 2014-07-21 07:49:45,136 snapshot.FlushSnapshotSubprocedure: ...
> Flush
> >>>>>> Snapshotting region
> >>>>>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.
> >>>>>> completed.
> >>>>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Closing
> >> region
> >>>>>> operation on
> >>>>>>
> >>
> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.2014-07-21
> >>>>>> 07:49:45,137 DEBUG [rs(xxx.digitalenvoy.net
> >> ,60020,1405943192177)-snapshot-pool3-thread-1]
> >>>>>> snapshot.FlushSnapshotSubprocedure: Starting region operation on
> >>>>>> hosts,\x00\x8A\x90\xD6\x08,1400
> >>>>>> 659179080.a74402fcbd9a96a7c92b250721095729.2014-07-21 07:49:45,137
> >> DEBUG
> >>>>>> [member: ‘xxx.digitalenvoy.net,60020,1405943192177'
> >>>>>> subprocedure-pool1-thread-2] snapshot.RegionServerSnapshotManager:
> >>>>>> Completed 1/174 local region snapshots.
> >>>>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Flush
> >>>>>> Snapshotting region
> >>>>>>
> >>
> hosts,\x00\x8A\x90\xD6\x08,1400659179080.a74402fcbd9a96a7c92b250721095729.
> >>>>>> started...
> >>>>>> 2014-07-21 07:49:45,137 regionserver.HRegion: Storing region-info
> for
> >>>>>> snapshot.
> >>>>>>
> >>>>>> On Jul 21, 2014, at 9:21 AM, Jean-Marc Spaggiari <
> >> jean-marc@spaggiari.org>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Can you also tell us more about your table? How many regions on how
> >> many
> >>>>>>> region servers?
> >>>>>>>
> >>>>>>>
> >>>>>>> 2014-07-21 8:23 GMT-04:00 Ted Yu <yu...@gmail.com>:
> >>>>>>>
> >>>>>>>> Normally such timeout is caused by one region server which is slow
> >> in
> >>>>>>>> completing its part of the snapshot procedure.
> >>>>>>>>
> >>>>>>>> Have you looked at region server logs ?
> >>>>>>>> Feel free to pastebin relevant portion.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>>
> >>>>>>>> On Jul 21, 2014, at 4:03 AM, Brian Jeltema <
> >>>>>> brian.jeltema@digitalenvoy.net>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> I’m running HBase 0.98. I’m trying to snapshot a table, but it’s
> >> timing
> >>>>>>>> out after 60 seconds.
> >>>>>>>>> I increased the value of hbase.snapshot.master.timeoutMillis and
> >>>>>>>> restarted HBase,
> >>>>>>>>> but the timeout still happens after 60 seconds. Any suggestions?
> >>>>>>>>>
> >>>>>>>>> Brian
> >>>>
> >>>
> >>
> >>
>
>

Re: snapshot timeout problem

Posted by Brian Jeltema <br...@digitalenvoy.net>.

That did the trick. I set it to 100 and regions are uniform now. Should I leave it there? What are the side-effects of this change?

Thanks.

Brian

On Jul 22, 2014, at 11:28 AM, Ted Yu <yu...@gmail.com> wrote:

> Here is code snippet from StochasticLoadBalancer
> w.r.t. TableSkewCostFunction :
> 
>    private static final String TABLE_SKEW_COST_KEY =
> 
>        "hbase.master.balancer.stochastic.tableSkewCost";
> 
>    private static final float DEFAULT_TABLE_SKEW_COST = 35;
> 
>    TableSkewCostFunction(Configuration conf) {
> 
>      super(conf);
> 
>      this.setMultiplier(conf.getFloat(TABLE_SKEW_COST_KEY,
> DEFAULT_TABLE_SKEW_COST));
> 
> You can try increasing the value for
> "hbase.master.balancer.stochastic.tableSkewCost"
> 
> 
> Cheers
> 
> 
> On Tue, Jul 22, 2014 at 6:59 AM, Brian Jeltema <
> brian.jeltema@digitalenvoy.net> wrote:
> 
>> I don’t understand the logging output, but I do see a strange pattern.
>> I’ll try to summarize.
>> 
>> There are 5 RegionServers, call them rs1 through rs5. There are a total of
>> 174 regions for the table in question,
>> with 69 in rs1. In the log output I see lines (greatly simplified) like
>> the following:
>> 
>>   AssignmentManager: Assigning fooTable, …. to rs2
>>   AssignmentManager: Assigning fooTable, …. to rs3
>>   AssignmentManager: Assigning fooTable, …. to rs4
>>   AssignmentManager: Assigning fooTable, …. to rs5
>> 
>> There are 106 such lines, none logging an assignment to rs1
>> 
>> I also see 105 lines like:
>> 
>>  AssignmentManager: Using pre-existing plan for fooTable … src=rs1 …
>> dest=rs2
>>  AssignmentManager: Using pre-existing plan for fooTable … src=rs1 …
>> dest=rs3
>>  …
>> 
>> where src=rs1 in every case, and dest=rs1 never occurs.
>> 
>> I don’t see any exceptions or log output that reports a problem.
>> 
>> 
>> On Jul 22, 2014, at 9:18 AM, Ted Yu <yu...@gmail.com> wrote:
>> 
>>> The load balancer in 0.98 considers many factors when making balancing
>> decisions.
>>> 
>>> Can you take a look at the master log and look for balancer related
>> lines ?
>>> That would give you some clue.
>>> 
>>> Cheers
>>> 
>>> On Jul 22, 2014, at 5:03 AM, Brian Jeltema <
>> brian.jeltema@digitalenvoy.net> wrote:
>>> 
>>>> I ran the balancer from hbase shell, but don’t see any change. Is there
>> a way to balance a specific table?
>>>> 
>>>>> bq. One RegionServer has 69 regions
>>>>> 
>>>>> Can you run load balancer so that your regions are better balanced ?
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> 
>>>>> On Mon, Jul 21, 2014 at 6:56 AM, Brian Jeltema <
>>>>> brian.jeltema@digitalenvoy.net> wrote:
>>>>> 
>>>>>> There are 174 regions, not well balanced. One RegionServer has 69
>> regions.
>>>>>> That RegionServer generates a
>>>>>> series of log entries (modified and shown below), one for each
>> region, at
>>>>>> roughly 1 to 2 second intervals. The timeout period expires when
>>>>>> it reaches region 36.
>>>>>> 
>>>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating references for
>>>>>> hfiles
>>>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Adding snapshot
>> references
>>>>>> for [hdfs://
>>>>>> 
>> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
>> ]
>>>>>> hfiles
>>>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating reference for
>> file
>>>>>> (1/1) : hdfs://
>>>>>> 
>> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
>>>>>> 2014-07-21 07:49:45,136 snapshot.FlushSnapshotSubprocedure: ... Flush
>>>>>> Snapshotting region
>>>>>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.
>>>>>> completed.
>>>>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Closing
>> region
>>>>>> operation on
>>>>>> 
>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.2014-07-21
>>>>>> 07:49:45,137 DEBUG [rs(xxx.digitalenvoy.net
>> ,60020,1405943192177)-snapshot-pool3-thread-1]
>>>>>> snapshot.FlushSnapshotSubprocedure: Starting region operation on
>>>>>> hosts,\x00\x8A\x90\xD6\x08,1400
>>>>>> 659179080.a74402fcbd9a96a7c92b250721095729.2014-07-21 07:49:45,137
>> DEBUG
>>>>>> [member: ‘xxx.digitalenvoy.net,60020,1405943192177'
>>>>>> subprocedure-pool1-thread-2] snapshot.RegionServerSnapshotManager:
>>>>>> Completed 1/174 local region snapshots.
>>>>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Flush
>>>>>> Snapshotting region
>>>>>> 
>> hosts,\x00\x8A\x90\xD6\x08,1400659179080.a74402fcbd9a96a7c92b250721095729.
>>>>>> started...
>>>>>> 2014-07-21 07:49:45,137 regionserver.HRegion: Storing region-info for
>>>>>> snapshot.
>>>>>> 
>>>>>> On Jul 21, 2014, at 9:21 AM, Jean-Marc Spaggiari <
>> jean-marc@spaggiari.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> Can you also tell us more about your table? How many regions on how
>> many
>>>>>>> region servers?
>>>>>>> 
>>>>>>> 
>>>>>>> 2014-07-21 8:23 GMT-04:00 Ted Yu <yu...@gmail.com>:
>>>>>>> 
>>>>>>>> Normally such timeout is caused by one region server which is slow
>> in
>>>>>>>> completing its part of the snapshot procedure.
>>>>>>>> 
>>>>>>>> Have you looked at region server logs ?
>>>>>>>> Feel free to pastebin relevant portion.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> 
>>>>>>>> On Jul 21, 2014, at 4:03 AM, Brian Jeltema <
>>>>>> brian.jeltema@digitalenvoy.net>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I’m running HBase 0.98. I’m trying to snapshot a table, but it’s
>> timing
>>>>>>>> out after 60 seconds.
>>>>>>>>> I increased the value of hbase.snapshot.master.timeoutMillis and
>>>>>>>> restarted HBase,
>>>>>>>>> but the timeout still happens after 60 seconds. Any suggestions?
>>>>>>>>> 
>>>>>>>>> Brian
>>>> 
>>> 
>> 
>>

Re: snapshot timeout problem

Posted by Ted Yu <yu...@gmail.com>.

Here is code snippet from StochasticLoadBalancer
w.r.t. TableSkewCostFunction :

    private static final String TABLE_SKEW_COST_KEY =

        "hbase.master.balancer.stochastic.tableSkewCost";

    private static final float DEFAULT_TABLE_SKEW_COST = 35;

    TableSkewCostFunction(Configuration conf) {

      super(conf);

      this.setMultiplier(conf.getFloat(TABLE_SKEW_COST_KEY,
DEFAULT_TABLE_SKEW_COST));

You can try increasing the value for
"hbase.master.balancer.stochastic.tableSkewCost"


Cheers


On Tue, Jul 22, 2014 at 6:59 AM, Brian Jeltema <
brian.jeltema@digitalenvoy.net> wrote:

> I don’t understand the logging output, but I do see a strange pattern.
> I’ll try to summarize.
>
> There are 5 RegionServers, call them rs1 through rs5. There are a total of
> 174 regions for the table in question,
> with 69 in rs1. In the log output I see lines (greatly simplified) like
> the following:
>
>    AssignmentManager: Assigning fooTable, …. to rs2
>    AssignmentManager: Assigning fooTable, …. to rs3
>    AssignmentManager: Assigning fooTable, …. to rs4
>    AssignmentManager: Assigning fooTable, …. to rs5
>
> There are 106 such lines, none logging an assignment to rs1
>
> I also see 105 lines like:
>
>   AssignmentManager: Using pre-existing plan for fooTable … src=rs1 …
> dest=rs2
>   AssignmentManager: Using pre-existing plan for fooTable … src=rs1 …
> dest=rs3
>   …
>
> where src=rs1 in every case, and dest=rs1 never occurs.
>
> I don’t see any exceptions or log output that reports a problem.
>
>
> On Jul 22, 2014, at 9:18 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > The load balancer in 0.98 considers many factors when making balancing
> decisions.
> >
> > Can you take a look at the master log and look for balancer related
> lines ?
> > That would give you some clue.
> >
> > Cheers
> >
> > On Jul 22, 2014, at 5:03 AM, Brian Jeltema <
> brian.jeltema@digitalenvoy.net> wrote:
> >
> >> I ran the balancer from hbase shell, but don’t see any change. Is there
> a way to balance a specific table?
> >>
> >>> bq. One RegionServer has 69 regions
> >>>
> >>> Can you run load balancer so that your regions are better balanced ?
> >>>
> >>> Cheers
> >>>
> >>>
> >>> On Mon, Jul 21, 2014 at 6:56 AM, Brian Jeltema <
> >>> brian.jeltema@digitalenvoy.net> wrote:
> >>>
> >>>> There are 174 regions, not well balanced. One RegionServer has 69
> regions.
> >>>> That RegionServer generates a
> >>>> series of log entries (modified and shown below), one for each
> region, at
> >>>> roughly 1 to 2 second intervals. The timeout period expires when
> >>>> it reaches region 36.
> >>>>
> >>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating references for
> >>>> hfiles
> >>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Adding snapshot
> references
> >>>> for [hdfs://
> >>>>
> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
> ]
> >>>> hfiles
> >>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating reference for
> file
> >>>> (1/1) : hdfs://
> >>>>
> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
> >>>> 2014-07-21 07:49:45,136 snapshot.FlushSnapshotSubprocedure: ... Flush
> >>>> Snapshotting region
> >>>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.
> >>>> completed.
> >>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Closing
> region
> >>>> operation on
> >>>>
> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.2014-07-21
> >>>> 07:49:45,137 DEBUG [rs(xxx.digitalenvoy.net
> ,60020,1405943192177)-snapshot-pool3-thread-1]
> >>>> snapshot.FlushSnapshotSubprocedure: Starting region operation on
> >>>> hosts,\x00\x8A\x90\xD6\x08,1400
> >>>> 659179080.a74402fcbd9a96a7c92b250721095729.2014-07-21 07:49:45,137
> DEBUG
> >>>> [member: ‘xxx.digitalenvoy.net,60020,1405943192177'
> >>>> subprocedure-pool1-thread-2] snapshot.RegionServerSnapshotManager:
> >>>> Completed 1/174 local region snapshots.
> >>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Flush
> >>>> Snapshotting region
> >>>>
> hosts,\x00\x8A\x90\xD6\x08,1400659179080.a74402fcbd9a96a7c92b250721095729.
> >>>> started...
> >>>> 2014-07-21 07:49:45,137 regionserver.HRegion: Storing region-info for
> >>>> snapshot.
> >>>>
> >>>> On Jul 21, 2014, at 9:21 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org>
> >>>> wrote:
> >>>>
> >>>>> Can you also tell us more about your table? How many regions on how
> many
> >>>>> region servers?
> >>>>>
> >>>>>
> >>>>> 2014-07-21 8:23 GMT-04:00 Ted Yu <yu...@gmail.com>:
> >>>>>
> >>>>>> Normally such timeout is caused by one region server which is slow
> in
> >>>>>> completing its part of the snapshot procedure.
> >>>>>>
> >>>>>> Have you looked at region server logs ?
> >>>>>> Feel free to pastebin relevant portion.
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>> On Jul 21, 2014, at 4:03 AM, Brian Jeltema <
> >>>> brian.jeltema@digitalenvoy.net>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I’m running HBase 0.98. I’m trying to snapshot a table, but it’s
> timing
> >>>>>> out after 60 seconds.
> >>>>>>> I increased the value of hbase.snapshot.master.timeoutMillis and
> >>>>>> restarted HBase,
> >>>>>>> but the timeout still happens after 60 seconds. Any suggestions?
> >>>>>>>
> >>>>>>> Brian
> >>
> >
>
>

Re: snapshot timeout problem

Posted by Brian Jeltema <br...@digitalenvoy.net>.

I don’t understand the logging output, but I do see a strange pattern. I’ll try to summarize.

There are 5 RegionServers, call them rs1 through rs5. There are a total of 174 regions for the table in question,
with 69 in rs1. In the log output I see lines (greatly simplified) like the following:

   AssignmentManager: Assigning fooTable, …. to rs2
   AssignmentManager: Assigning fooTable, …. to rs3
   AssignmentManager: Assigning fooTable, …. to rs4
   AssignmentManager: Assigning fooTable, …. to rs5

There are 106 such lines, none logging an assignment to rs1

I also see 105 lines like:

  AssignmentManager: Using pre-existing plan for fooTable … src=rs1 … dest=rs2
  AssignmentManager: Using pre-existing plan for fooTable … src=rs1 … dest=rs3
  …

where src=rs1 in every case, and dest=rs1 never occurs.

I don’t see any exceptions or log output that reports a problem.


On Jul 22, 2014, at 9:18 AM, Ted Yu <yu...@gmail.com> wrote:

> The load balancer in 0.98 considers many factors when making balancing decisions. 
> 
> Can you take a look at the master log and look for balancer related lines ?
> That would give you some clue. 
> 
> Cheers
> 
> On Jul 22, 2014, at 5:03 AM, Brian Jeltema <br...@digitalenvoy.net> wrote:
> 
>> I ran the balancer from hbase shell, but don’t see any change. Is there a way to balance a specific table?
>> 
>>> bq. One RegionServer has 69 regions
>>> 
>>> Can you run load balancer so that your regions are better balanced ?
>>> 
>>> Cheers
>>> 
>>> 
>>> On Mon, Jul 21, 2014 at 6:56 AM, Brian Jeltema <
>>> brian.jeltema@digitalenvoy.net> wrote:
>>> 
>>>> There are 174 regions, not well balanced. One RegionServer has 69 regions.
>>>> That RegionServer generates a
>>>> series of log entries (modified and shown below), one for each region, at
>>>> roughly 1 to 2 second intervals. The timeout period expires when
>>>> it reaches region 36.
>>>> 
>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating references for
>>>> hfiles
>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Adding snapshot references
>>>> for [hdfs://
>>>> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2]
>>>> hfiles
>>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating reference for file
>>>> (1/1) : hdfs://
>>>> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
>>>> 2014-07-21 07:49:45,136 snapshot.FlushSnapshotSubprocedure: ... Flush
>>>> Snapshotting region
>>>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.
>>>> completed.
>>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Closing region
>>>> operation on
>>>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.2014-07-21
>>>> 07:49:45,137 DEBUG [rs(xxx.digitalenvoy.net,60020,1405943192177)-snapshot-pool3-thread-1]
>>>> snapshot.FlushSnapshotSubprocedure: Starting region operation on
>>>> hosts,\x00\x8A\x90\xD6\x08,1400
>>>> 659179080.a74402fcbd9a96a7c92b250721095729.2014-07-21 07:49:45,137 DEBUG
>>>> [member: ‘xxx.digitalenvoy.net,60020,1405943192177'
>>>> subprocedure-pool1-thread-2] snapshot.RegionServerSnapshotManager:
>>>> Completed 1/174 local region snapshots.
>>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Flush
>>>> Snapshotting region
>>>> hosts,\x00\x8A\x90\xD6\x08,1400659179080.a74402fcbd9a96a7c92b250721095729.
>>>> started...
>>>> 2014-07-21 07:49:45,137 regionserver.HRegion: Storing region-info for
>>>> snapshot.
>>>> 
>>>> On Jul 21, 2014, at 9:21 AM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>>> wrote:
>>>> 
>>>>> Can you also tell us more about your table? How many regions on how many
>>>>> region servers?
>>>>> 
>>>>> 
>>>>> 2014-07-21 8:23 GMT-04:00 Ted Yu <yu...@gmail.com>:
>>>>> 
>>>>>> Normally such timeout is caused by one region server which is slow in
>>>>>> completing its part of the snapshot procedure.
>>>>>> 
>>>>>> Have you looked at region server logs ?
>>>>>> Feel free to pastebin relevant portion.
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> On Jul 21, 2014, at 4:03 AM, Brian Jeltema <
>>>> brian.jeltema@digitalenvoy.net>
>>>>>> wrote:
>>>>>> 
>>>>>>> I’m running HBase 0.98. I’m trying to snapshot a table, but it’s timing
>>>>>> out after 60 seconds.
>>>>>>> I increased the value of hbase.snapshot.master.timeoutMillis and
>>>>>> restarted HBase,
>>>>>>> but the timeout still happens after 60 seconds. Any suggestions?
>>>>>>> 
>>>>>>> Brian
>> 
>

Re: snapshot timeout problem

Posted by Ted Yu <yu...@gmail.com>.

The load balancer in 0.98 considers many factors when making balancing decisions. 

Can you take a look at the master log and look for balancer related lines ?
That would give you some clue. 

Cheers

On Jul 22, 2014, at 5:03 AM, Brian Jeltema <br...@digitalenvoy.net> wrote:

> I ran the balancer from hbase shell, but don’t see any change. Is there a way to balance a specific table?
> 
>> bq. One RegionServer has 69 regions
>> 
>> Can you run load balancer so that your regions are better balanced ?
>> 
>> Cheers
>> 
>> 
>> On Mon, Jul 21, 2014 at 6:56 AM, Brian Jeltema <
>> brian.jeltema@digitalenvoy.net> wrote:
>> 
>>> There are 174 regions, not well balanced. One RegionServer has 69 regions.
>>> That RegionServer generates a
>>> series of log entries (modified and shown below), one for each region, at
>>> roughly 1 to 2 second intervals. The timeout period expires when
>>> it reaches region 36.
>>> 
>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating references for
>>> hfiles
>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Adding snapshot references
>>> for [hdfs://
>>> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2]
>>> hfiles
>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating reference for file
>>> (1/1) : hdfs://
>>> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
>>> 2014-07-21 07:49:45,136 snapshot.FlushSnapshotSubprocedure: ... Flush
>>> Snapshotting region
>>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.
>>> completed.
>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Closing region
>>> operation on
>>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.2014-07-21
>>> 07:49:45,137 DEBUG [rs(xxx.digitalenvoy.net,60020,1405943192177)-snapshot-pool3-thread-1]
>>> snapshot.FlushSnapshotSubprocedure: Starting region operation on
>>> hosts,\x00\x8A\x90\xD6\x08,1400
>>> 659179080.a74402fcbd9a96a7c92b250721095729.2014-07-21 07:49:45,137 DEBUG
>>> [member: ‘xxx.digitalenvoy.net,60020,1405943192177'
>>> subprocedure-pool1-thread-2] snapshot.RegionServerSnapshotManager:
>>> Completed 1/174 local region snapshots.
>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Flush
>>> Snapshotting region
>>> hosts,\x00\x8A\x90\xD6\x08,1400659179080.a74402fcbd9a96a7c92b250721095729.
>>> started...
>>> 2014-07-21 07:49:45,137 regionserver.HRegion: Storing region-info for
>>> snapshot.
>>> 
>>> On Jul 21, 2014, at 9:21 AM, Jean-Marc Spaggiari <je...@spaggiari.org>
>>> wrote:
>>> 
>>>> Can you also tell us more about your table? How many regions on how many
>>>> region servers?
>>>> 
>>>> 
>>>> 2014-07-21 8:23 GMT-04:00 Ted Yu <yu...@gmail.com>:
>>>> 
>>>>> Normally such timeout is caused by one region server which is slow in
>>>>> completing its part of the snapshot procedure.
>>>>> 
>>>>> Have you looked at region server logs ?
>>>>> Feel free to pastebin relevant portion.
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> On Jul 21, 2014, at 4:03 AM, Brian Jeltema <
>>> brian.jeltema@digitalenvoy.net>
>>>>> wrote:
>>>>> 
>>>>>> I’m running HBase 0.98. I’m trying to snapshot a table, but it’s timing
>>>>> out after 60 seconds.
>>>>>> I increased the value of hbase.snapshot.master.timeoutMillis and
>>>>> restarted HBase,
>>>>>> but the timeout still happens after 60 seconds. Any suggestions?
>>>>>> 
>>>>>> Brian
>

Re: snapshot timeout problem

Posted by Brian Jeltema <br...@digitalenvoy.net>.

I ran the balancer from hbase shell, but don’t see any change. Is there a way to balance a specific table?

> bq. One RegionServer has 69 regions
> 
> Can you run load balancer so that your regions are better balanced ?
> 
> Cheers
> 
> 
> On Mon, Jul 21, 2014 at 6:56 AM, Brian Jeltema <
> brian.jeltema@digitalenvoy.net> wrote:
> 
>> There are 174 regions, not well balanced. One RegionServer has 69 regions.
>> That RegionServer generates a
>> series of log entries (modified and shown below), one for each region, at
>> roughly 1 to 2 second intervals. The timeout period expires when
>> it reaches region 36.
>> 
>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating references for
>> hfiles
>> 2014-07-21 07:49:44,503 regionserver.HRegion: Adding snapshot references
>> for [hdfs://
>> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2]
>> hfiles
>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating reference for file
>> (1/1) : hdfs://
>> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
>> 2014-07-21 07:49:45,136 snapshot.FlushSnapshotSubprocedure: ... Flush
>> Snapshotting region
>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.
>> completed.
>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Closing region
>> operation on
>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.2014-07-21
>> 07:49:45,137 DEBUG [rs(xxx.digitalenvoy.net,60020,1405943192177)-snapshot-pool3-thread-1]
>> snapshot.FlushSnapshotSubprocedure: Starting region operation on
>> hosts,\x00\x8A\x90\xD6\x08,1400
>> 659179080.a74402fcbd9a96a7c92b250721095729.2014-07-21 07:49:45,137 DEBUG
>> [member: ‘xxx.digitalenvoy.net,60020,1405943192177'
>> subprocedure-pool1-thread-2] snapshot.RegionServerSnapshotManager:
>> Completed 1/174 local region snapshots.
>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Flush
>> Snapshotting region
>> hosts,\x00\x8A\x90\xD6\x08,1400659179080.a74402fcbd9a96a7c92b250721095729.
>> started...
>> 2014-07-21 07:49:45,137 regionserver.HRegion: Storing region-info for
>> snapshot.
>> 
>> On Jul 21, 2014, at 9:21 AM, Jean-Marc Spaggiari <je...@spaggiari.org>
>> wrote:
>> 
>>> Can you also tell us more about your table? How many regions on how many
>>> region servers?
>>> 
>>> 
>>> 2014-07-21 8:23 GMT-04:00 Ted Yu <yu...@gmail.com>:
>>> 
>>>> Normally such timeout is caused by one region server which is slow in
>>>> completing its part of the snapshot procedure.
>>>> 
>>>> Have you looked at region server logs ?
>>>> Feel free to pastebin relevant portion.
>>>> 
>>>> Thanks
>>>> 
>>>> On Jul 21, 2014, at 4:03 AM, Brian Jeltema <
>> brian.jeltema@digitalenvoy.net>
>>>> wrote:
>>>> 
>>>>> I’m running HBase 0.98. I’m trying to snapshot a table, but it’s timing
>>>> out after 60 seconds.
>>>>> I increased the value of hbase.snapshot.master.timeoutMillis and
>>>> restarted HBase,
>>>>> but the timeout still happens after 60 seconds. Any suggestions?
>>>>> 
>>>>> Brian
>>>> 
>> 
>>

Re: snapshot timeout problem

Posted by Ted Yu <yu...@gmail.com>.

bq. One RegionServer has 69 regions

Can you run load balancer so that your regions are better balanced ?

Cheers


On Mon, Jul 21, 2014 at 6:56 AM, Brian Jeltema <
brian.jeltema@digitalenvoy.net> wrote:

> There are 174 regions, not well balanced. One RegionServer has 69 regions.
> That RegionServer generates a
> series of log entries (modified and shown below), one for each region, at
> roughly 1 to 2 second intervals. The timeout period expires when
> it reaches region 36.
>
> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating references for
> hfiles
> 2014-07-21 07:49:44,503 regionserver.HRegion: Adding snapshot references
> for [hdfs://
> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2]
> hfiles
> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating reference for file
> (1/1) : hdfs://
> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
> 2014-07-21 07:49:45,136 snapshot.FlushSnapshotSubprocedure: ... Flush
> Snapshotting region
> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.
> completed.
> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Closing region
> operation on
> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.2014-07-21
> 07:49:45,137 DEBUG [rs(xxx.digitalenvoy.net,60020,1405943192177)-snapshot-pool3-thread-1]
> snapshot.FlushSnapshotSubprocedure: Starting region operation on
> hosts,\x00\x8A\x90\xD6\x08,1400
> 659179080.a74402fcbd9a96a7c92b250721095729.2014-07-21 07:49:45,137 DEBUG
> [member: ‘xxx.digitalenvoy.net,60020,1405943192177'
> subprocedure-pool1-thread-2] snapshot.RegionServerSnapshotManager:
> Completed 1/174 local region snapshots.
> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Flush
> Snapshotting region
> hosts,\x00\x8A\x90\xD6\x08,1400659179080.a74402fcbd9a96a7c92b250721095729.
> started...
> 2014-07-21 07:49:45,137 regionserver.HRegion: Storing region-info for
> snapshot.
>
> On Jul 21, 2014, at 9:21 AM, Jean-Marc Spaggiari <je...@spaggiari.org>
> wrote:
>
> > Can you also tell us more about your table? How many regions on how many
> > region servers?
> >
> >
> > 2014-07-21 8:23 GMT-04:00 Ted Yu <yu...@gmail.com>:
> >
> >> Normally such timeout is caused by one region server which is slow in
> >> completing its part of the snapshot procedure.
> >>
> >> Have you looked at region server logs ?
> >> Feel free to pastebin relevant portion.
> >>
> >> Thanks
> >>
> >> On Jul 21, 2014, at 4:03 AM, Brian Jeltema <
> brian.jeltema@digitalenvoy.net>
> >> wrote:
> >>
> >>> I’m running HBase 0.98. I’m trying to snapshot a table, but it’s timing
> >> out after 60 seconds.
> >>> I increased the value of hbase.snapshot.master.timeoutMillis and
> >> restarted HBase,
> >>> but the timeout still happens after 60 seconds. Any suggestions?
> >>>
> >>> Brian
> >>
>
>

Re: snapshot timeout problem

Posted by Ishan Chhabra <ic...@rocketfuel.com>.

The snapshot timeout properties are confusingly named and I dug through the
code to understand them some time ago. Use these:

  <property>
    <name>hbase.snapshot.master.timeoutMillis</name>
    <!-- Change from default of 60s to 600s to allow for slow flushing of
tables -->
    <value>600000</value>
    <description>
      This is the time HBase master waits for the snapshot operation to
complete.
      Do not confuse this hbase.snapshot.master.timeout.millis, which
although
      sounding similar, serves a very different purpose.
      Note: This property has a completely different meaning before hbase
version
      0.94.11 and should not enabled on a cluster using snapshots and
running
      a version before 0.94.11.
    </description>
  </property>
  <property>
    <name>hbase.snapshot.master.timeout.millis</name>
    <!-- Change from default of 60s to 600s to allow for slow flushing of
tables -->
    <value>600000</value>
    <description>
      This is the timeout the master indicates the client to wait when it
takes
      the snapshot. The client actually waits longer than this due to
exponential
      backoff. See HBaseAdmin.snapshot for the exact algorithm.
    </description>
  </property>
  <property>
    <name>hbase.snapshot.region.timeout</name>
    <!-- Change from default of 60s to 600s to allow for slow flushing of
tables -->
    <value>600000</value>
    <description>
      This is the time the regionserver waits to complete all of its
activities
      for a snapshot operation.
    </description>
  </property>


On Mon, Jul 21, 2014 at 7:02 AM, Matteo Bertozzi <th...@gmail.com>
wrote:

> There are two timeout properties. one on the region server side and the
> other one on master side (the coordinator).
>
> "hbase.snapshot.master.timeoutMillis"
> "hbase.snapshot.region.timeout"
>
> increasing the master side only has no effect since the region server side
> will send a timeout to the master after the default 60sec.
>
>
> Matteo
>
>
>
> On Mon, Jul 21, 2014 at 2:56 PM, Brian Jeltema <
> brian.jeltema@digitalenvoy.net> wrote:
>
> > There are 174 regions, not well balanced. One RegionServer has 69
> regions.
> > That RegionServer generates a
> > series of log entries (modified and shown below), one for each region, at
> > roughly 1 to 2 second intervals. The timeout period expires when
> > it reaches region 36.
> >
> > 2014-07-21 07:49:44,503 regionserver.HRegion: Creating references for
> > hfiles
> > 2014-07-21 07:49:44,503 regionserver.HRegion: Adding snapshot references
> > for [hdfs://
> >
> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
> ]
> > hfiles
> > 2014-07-21 07:49:44,503 regionserver.HRegion: Creating reference for file
> > (1/1) : hdfs://
> >
> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
> > 2014-07-21 07:49:45,136 snapshot.FlushSnapshotSubprocedure: ... Flush
> > Snapshotting region
> > hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.
> > completed.
> > 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Closing
> region
> > operation on
> >
> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.2014-07-21
> > 07:49:45,137 DEBUG [rs(xxx.digitalenvoy.net
> ,60020,1405943192177)-snapshot-pool3-thread-1]
> > snapshot.FlushSnapshotSubprocedure: Starting region operation on
> > hosts,\x00\x8A\x90\xD6\x08,1400
> > 659179080.a74402fcbd9a96a7c92b250721095729.2014-07-21 07:49:45,137 DEBUG
> > [member: ‘xxx.digitalenvoy.net,60020,1405943192177'
> > subprocedure-pool1-thread-2] snapshot.RegionServerSnapshotManager:
> > Completed 1/174 local region snapshots.
> > 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Flush
> > Snapshotting region
> >
> hosts,\x00\x8A\x90\xD6\x08,1400659179080.a74402fcbd9a96a7c92b250721095729.
> > started...
> > 2014-07-21 07:49:45,137 regionserver.HRegion: Storing region-info for
> > snapshot.
> >
> > On Jul 21, 2014, at 9:21 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org>
> > wrote:
> >
> > > Can you also tell us more about your table? How many regions on how
> many
> > > region servers?
> > >
> > >
> > > 2014-07-21 8:23 GMT-04:00 Ted Yu <yu...@gmail.com>:
> > >
> > >> Normally such timeout is caused by one region server which is slow in
> > >> completing its part of the snapshot procedure.
> > >>
> > >> Have you looked at region server logs ?
> > >> Feel free to pastebin relevant portion.
> > >>
> > >> Thanks
> > >>
> > >> On Jul 21, 2014, at 4:03 AM, Brian Jeltema <
> > brian.jeltema@digitalenvoy.net>
> > >> wrote:
> > >>
> > >>> I’m running HBase 0.98. I’m trying to snapshot a table, but it’s
> timing
> > >> out after 60 seconds.
> > >>> I increased the value of hbase.snapshot.master.timeoutMillis and
> > >> restarted HBase,
> > >>> but the timeout still happens after 60 seconds. Any suggestions?
> > >>>
> > >>> Brian
> > >>
> >
> >
>



-- 
*Ishan Chhabra *| Rocket Scientist | RocketFuel Inc.

Re: snapshot timeout problem

Posted by Matteo Bertozzi <th...@gmail.com>.

There are two timeout properties. one on the region server side and the
other one on master side (the coordinator).

"hbase.snapshot.master.timeoutMillis"
"hbase.snapshot.region.timeout"

increasing the master side only has no effect since the region server side
will send a timeout to the master after the default 60sec.


Matteo



On Mon, Jul 21, 2014 at 2:56 PM, Brian Jeltema <
brian.jeltema@digitalenvoy.net> wrote:

> There are 174 regions, not well balanced. One RegionServer has 69 regions.
> That RegionServer generates a
> series of log entries (modified and shown below), one for each region, at
> roughly 1 to 2 second intervals. The timeout period expires when
> it reaches region 36.
>
> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating references for
> hfiles
> 2014-07-21 07:49:44,503 regionserver.HRegion: Adding snapshot references
> for [hdfs://
> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2]
> hfiles
> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating reference for file
> (1/1) : hdfs://
> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
> 2014-07-21 07:49:45,136 snapshot.FlushSnapshotSubprocedure: ... Flush
> Snapshotting region
> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.
> completed.
> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Closing region
> operation on
> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.2014-07-21
> 07:49:45,137 DEBUG [rs(xxx.digitalenvoy.net,60020,1405943192177)-snapshot-pool3-thread-1]
> snapshot.FlushSnapshotSubprocedure: Starting region operation on
> hosts,\x00\x8A\x90\xD6\x08,1400
> 659179080.a74402fcbd9a96a7c92b250721095729.2014-07-21 07:49:45,137 DEBUG
> [member: ‘xxx.digitalenvoy.net,60020,1405943192177'
> subprocedure-pool1-thread-2] snapshot.RegionServerSnapshotManager:
> Completed 1/174 local region snapshots.
> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Flush
> Snapshotting region
> hosts,\x00\x8A\x90\xD6\x08,1400659179080.a74402fcbd9a96a7c92b250721095729.
> started...
> 2014-07-21 07:49:45,137 regionserver.HRegion: Storing region-info for
> snapshot.
>
> On Jul 21, 2014, at 9:21 AM, Jean-Marc Spaggiari <je...@spaggiari.org>
> wrote:
>
> > Can you also tell us more about your table? How many regions on how many
> > region servers?
> >
> >
> > 2014-07-21 8:23 GMT-04:00 Ted Yu <yu...@gmail.com>:
> >
> >> Normally such timeout is caused by one region server which is slow in
> >> completing its part of the snapshot procedure.
> >>
> >> Have you looked at region server logs ?
> >> Feel free to pastebin relevant portion.
> >>
> >> Thanks
> >>
> >> On Jul 21, 2014, at 4:03 AM, Brian Jeltema <
> brian.jeltema@digitalenvoy.net>
> >> wrote:
> >>
> >>> I’m running HBase 0.98. I’m trying to snapshot a table, but it’s timing
> >> out after 60 seconds.
> >>> I increased the value of hbase.snapshot.master.timeoutMillis and
> >> restarted HBase,
> >>> but the timeout still happens after 60 seconds. Any suggestions?
> >>>
> >>> Brian
> >>
>
>

Re: snapshot timeout problem

Posted by Brian Jeltema <br...@digitalenvoy.net>.

There are 174 regions, not well balanced. One RegionServer has 69 regions. That RegionServer generates a
series of log entries (modified and shown below), one for each region, at roughly 1 to 2 second intervals. The timeout period expires when
it reaches region 36. 

2014-07-21 07:49:44,503 regionserver.HRegion: Creating references for hfiles
2014-07-21 07:49:44,503 regionserver.HRegion: Adding snapshot references for [hdfs://xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2] hfiles
2014-07-21 07:49:44,503 regionserver.HRegion: Creating reference for file (1/1) : hdfs://xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
2014-07-21 07:49:45,136 snapshot.FlushSnapshotSubprocedure: ... Flush Snapshotting region hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6. completed.
2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Closing region operation on hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.2014-07-21 07:49:45,137 DEBUG [rs(xxx.digitalenvoy.net,60020,1405943192177)-snapshot-pool3-thread-1] snapshot.FlushSnapshotSubprocedure: Starting region operation on hosts,\x00\x8A\x90\xD6\x08,1400
659179080.a74402fcbd9a96a7c92b250721095729.2014-07-21 07:49:45,137 DEBUG [member: ‘xxx.digitalenvoy.net,60020,1405943192177' subprocedure-pool1-thread-2] snapshot.RegionServerSnapshotManager: Completed 1/174 local region snapshots.
2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Flush Snapshotting region hosts,\x00\x8A\x90\xD6\x08,1400659179080.a74402fcbd9a96a7c92b250721095729. started...
2014-07-21 07:49:45,137 regionserver.HRegion: Storing region-info for snapshot.

On Jul 21, 2014, at 9:21 AM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:

> Can you also tell us more about your table? How many regions on how many
> region servers?
> 
> 
> 2014-07-21 8:23 GMT-04:00 Ted Yu <yu...@gmail.com>:
> 
>> Normally such timeout is caused by one region server which is slow in
>> completing its part of the snapshot procedure.
>> 
>> Have you looked at region server logs ?
>> Feel free to pastebin relevant portion.
>> 
>> Thanks
>> 
>> On Jul 21, 2014, at 4:03 AM, Brian Jeltema <br...@digitalenvoy.net>
>> wrote:
>> 
>>> I’m running HBase 0.98. I’m trying to snapshot a table, but it’s timing
>> out after 60 seconds.
>>> I increased the value of hbase.snapshot.master.timeoutMillis and
>> restarted HBase,
>>> but the timeout still happens after 60 seconds. Any suggestions?
>>> 
>>> Brian
>>

Re: snapshot timeout problem

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Can you also tell us more about your table? How many regions on how many
region servers?


2014-07-21 8:23 GMT-04:00 Ted Yu <yu...@gmail.com>:

> Normally such timeout is caused by one region server which is slow in
> completing its part of the snapshot procedure.
>
> Have you looked at region server logs ?
> Feel free to pastebin relevant portion.
>
> Thanks
>
> On Jul 21, 2014, at 4:03 AM, Brian Jeltema <br...@digitalenvoy.net>
> wrote:
>
> > I’m running HBase 0.98. I’m trying to snapshot a table, but it’s timing
> out after 60 seconds.
> > I increased the value of hbase.snapshot.master.timeoutMillis and
> restarted HBase,
> > but the timeout still happens after 60 seconds. Any suggestions?
> >
> > Brian
>

Re: snapshot timeout problem

Posted by Ted Yu <yu...@gmail.com>.

Normally such timeout is caused by one region server which is slow in completing its part of the snapshot procedure. 

Have you looked at region server logs ?
Feel free to pastebin relevant portion. 

Thanks

On Jul 21, 2014, at 4:03 AM, Brian Jeltema <br...@digitalenvoy.net> wrote:

> I’m running HBase 0.98. I’m trying to snapshot a table, but it’s timing out after 60 seconds.
> I increased the value of hbase.snapshot.master.timeoutMillis and restarted HBase,
> but the timeout still happens after 60 seconds. Any suggestions?
> 
> Brian

Re: snapshot timeout problem

Posted by sudhakara st <su...@gmail.com>.

Hello Brain,

Time out will occurs for various reason, can you post full stack trace ?




On Mon, Jul 21, 2014 at 5:33 PM, Brian Jeltema <
brian.jeltema@digitalenvoy.net> wrote:

> I’m running HBase 0.98. I’m trying to snapshot a table, but it’s timing
> out after 60 seconds.
> I increased the value of hbase.snapshot.master.timeoutMillis and restarted
> HBase,
> but the timeout still happens after 60 seconds. Any suggestions?
>
> Brian




-- 

Regards,
...sudhakara