You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Dan Harvey <da...@mendeley.com> on 2010/06/01 11:57:25 UTC

Re: Missing Split (full message)

Thanks,

I had a look into this and running :- ./hbase org.jruby.Main
add_table.rb hdfs://<server>/hbase/<table>

Fixed the problem with the meta in this case for us.

In what cases would a datanode failure (for example running out of
memory in ourcase) cause HBase data loss?
Would it mostly only causes dataloss to the meta regions or does it
also cause problems with the actual region files?

On 25 May 2010 18:47, Jean-Daniel Cryans <jd...@apache.org> wrote:
> The edits to .META. were likely lost, so scanning .META. won't solve
> the issue (although it could be smarter and figure that there's a
> hole, find the missing region on HDFS, and add it back).
>
> So your region is probably physically on HDFS. See the
> bin/add_table.rb script that will help you getting that line back in
> .META., do disable your table before running it. Search the archives
> of this mailing for others who had the same issue if something doesn't
> seem clear.
>
> I'd also like to point out that those edits were lost because HDFS
> won't support fsSync until 0.21, so data loss is likely in the face of
> machine and process failure.
>
> J-D
>
> On Mon, May 24, 2010 at 2:39 PM, Dan Harvey <da...@mendeley.com> wrote:
>> Hi,
>>
>> Sorry for the multiple e-mails, it seems gmail didn't send my whole
>> message last time! Anyway here it goes again...
>>
>> Whilst loading data via a mapreduce job into HBase I have started getting
>> this error :-
>>
>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>> contact region server Some server, retryOnlyOne=true, index=0,
>> islastrow=false, tries=9, numtries=10, i=0, listsize=19,
>> region=source_documents,ipubmed\x219915054,1274525958679 for region
>> source_documents,ipubmed\x219915054,1274525958679, row 'u1012913162',
>> but failed after 10 attempts.
>> Exceptions:
>> at org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1166)
>> at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1247)
>> at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:609)
>>
>> In the master there are the following three regions :-
>>
>> source_documents,ipubmed\x219859228,1274701893687       hadoop1
>> 1825870642      ipubmed\x219859228      ipubmed\x219915054
>> source_documents,ipubmed\x219915054,1274525958679       hadoop4
>> 193393334        ipubmed\x219915054      u102193588
>> source_documents,u102193588,1274486550122                    hadoop4
>> 2141795358      u102193588                    u105043522
>>
>> and on one of our 5 nodes I found a region which start with
>>
>> ipubmed\x219915054 and ends with u102002564
>>
>> and on another I found the other half of the split which starts with
>>
>> u102002564 and ends with u102193588
>>
>> So it seems that the middle region on the master was split apart but
>> that failed to reach the master.
>>
>> We've had a few problems over the last few days with hdfs nodes
>> failing due to lack of memory which has now been fixed but could have
>> been a cause of this problem.
>>
>> What ways can a split fail to be received by the master and how long
>> would it take for hbase to fix this? I've read it periodically will
>> scan the META table to find problems like this but didn't say how
>> often? It has been about 12h here and our cluster didn't appear to
>> have fixed this missing split, is there a way to force the master to
>> rescan the META table? Will it fix problems like this given time?
>>
>> Thanks,
>>
>> --
>> Dan Harvey | Datamining Engineer
>> www.mendeley.com/profiles/dan-harvey
>>
>> Mendeley Limited | London, UK | www.mendeley.com
>> Registered in England and Wales | Company Number 6419015
>>
>

-- 
Dan Harvey | Datamining Engineer
www.mendeley.com/profiles/dan-harvey

Mendeley Limited | London, UK | www.mendeley.com
Registered in England and Wales | Company Number 6419015

RE: Missing Split (full message)

Posted by Jonathan Gray <jg...@facebook.com>.

FYI, the 0.20 append branch has now been created.  Patches will be trickling in over the next week.

http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/

or the actual svn repo:

https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append/

The list of jiras for this branch:

https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&&pid=12310942&fixfor=12315103&resolution=-1&sorter/field=priority&sorter/order=DESC


JG

> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> Stack
> Sent: Wednesday, June 02, 2010 8:26 AM
> To: user@hbase.apache.org
> Subject: Re: Missing Split (full message)
> 
> On Wed, Jun 2, 2010 at 2:13 AM, Dan Harvey <da...@mendeley.com>
> wrote:
> > Yes we're running Cloudera CDH2 which I've just checked includes a
> > back ported hdfs-630 patch.
> >
> > I guess a lot of these issues will be gone once hadoop 0.21 is out
> and
> > hbase can take advantage of the new features.
> >
> 
> Thats the hope.  A bunch of fixes have gone in for hbase provoked hdfs
> issues in 0.20.  Look out for the append branch in hdfs 0.20 coming
> soon (It'll be here:
> http://svn.apache.org/viewvc/hadoop/common/branches/).  It'll be a
> 0.20 branch with support for append (hdfs-200, hdfs-142, etc.) and
> other fixes needed by hbase.  Thats what the next major hbase will
> ship against (CDH3 will include this stuff and then some, if I
> understand Todd+crew's plans correctly).
> 
> Good on you Dan,
> St.Ack
> 
> 
> > Thanks,
> >
> > On 2 June 2010 01:10, Stack <st...@duboce.net> wrote:
> >> Hey Dan:
> >>
> >> On Tue, Jun 1, 2010 at 2:57 AM, Dan Harvey <da...@mendeley.com>
> wrote:
> >>> In what cases would a datanode failure (for example running out of
> >>> memory in ourcase) cause HBase data loss?
> >>
> >> We should just move past the damaged DN on to the other replicas but
> >> there are probably places where we can get hungup.  Out of interest
> >> are you running with hdfs-630 inplace?
> >>
> >>> Would it mostly only causes dataloss to the meta regions or does it
> >>> also cause problems with the actual region files?
> >>>
> >>
> >> HDFS files that had their blocks located on the damaged DN would be
> >> susceptible (meta files are just like any other).
> >>
> >> St.Ack
> >>
> >>>> On Mon, May 24, 2010 at 2:39 PM, Dan Harvey
> <da...@mendeley.com> wrote:
> >>>>> Hi,
> >>>>>
> >>>>> Sorry for the multiple e-mails, it seems gmail didn't send my
> whole
> >>>>> message last time! Anyway here it goes again...
> >>>>>
> >>>>> Whilst loading data via a mapreduce job into HBase I have started
> getting
> >>>>> this error :-
> >>>>>
> >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying
> to
> >>>>> contact region server Some server, retryOnlyOne=true, index=0,
> >>>>> islastrow=false, tries=9, numtries=10, i=0, listsize=19,
> >>>>> region=source_documents,ipubmed\x219915054,1274525958679 for
> region
> >>>>> source_documents,ipubmed\x219915054,1274525958679, row
> 'u1012913162',
> >>>>> but failed after 10 attempts.
> >>>>> Exceptions:
> >>>>> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.pr
> ocess(HConnectionManager.java:1166)
> >>>>> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processB
> atchOfRows(HConnectionManager.java:1247)
> >>>>> at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:609)
> >>>>>
> >>>>> In the master there are the following three regions :-
> >>>>>
> >>>>> source_documents,ipubmed\x219859228,1274701893687       hadoop1
> >>>>> 1825870642      ipubmed\x219859228      ipubmed\x219915054
> >>>>> source_documents,ipubmed\x219915054,1274525958679       hadoop4
> >>>>> 193393334        ipubmed\x219915054      u102193588
> >>>>> source_documents,u102193588,1274486550122
>  hadoop4
> >>>>> 2141795358      u102193588                    u105043522
> >>>>>
> >>>>> and on one of our 5 nodes I found a region which start with
> >>>>>
> >>>>> ipubmed\x219915054 and ends with u102002564
> >>>>>
> >>>>> and on another I found the other half of the split which starts
> with
> >>>>>
> >>>>> u102002564 and ends with u102193588
> >>>>>
> >>>>> So it seems that the middle region on the master was split apart
> but
> >>>>> that failed to reach the master.
> >>>>>
> >>>>> We've had a few problems over the last few days with hdfs nodes
> >>>>> failing due to lack of memory which has now been fixed but could
> have
> >>>>> been a cause of this problem.
> >>>>>
> >>>>> What ways can a split fail to be received by the master and how
> long
> >>>>> would it take for hbase to fix this? I've read it periodically
> will
> >>>>> scan the META table to find problems like this but didn't say how
> >>>>> often? It has been about 12h here and our cluster didn't appear
> to
> >>>>> have fixed this missing split, is there a way to force the master
> to
> >>>>> rescan the META table? Will it fix problems like this given time?
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> --
> >>>>> Dan Harvey | Datamining Engineer
> >>>>> www.mendeley.com/profiles/dan-harvey
> >>>>>
> >>>>> Mendeley Limited | London, UK | www.mendeley.com
> >>>>> Registered in England and Wales | Company Number 6419015
> >>>>>
> >>>>
> >>>
> >>> --
> >>> Dan Harvey | Datamining Engineer
> >>> www.mendeley.com/profiles/dan-harvey
> >>>
> >>> Mendeley Limited | London, UK | www.mendeley.com
> >>> Registered in England and Wales | Company Number 6419015
> >>>
> >>
> >
> > --
> > Dan Harvey | Datamining Engineer
> > www.mendeley.com/profiles/dan-harvey
> >
> > Mendeley Limited | London, UK | www.mendeley.com
> > Registered in England and Wales | Company Number 6419015
> >

Re: Missing Split (full message)

Posted by Stack <st...@duboce.net>.

On Wed, Jun 2, 2010 at 2:13 AM, Dan Harvey <da...@mendeley.com> wrote:
> Yes we're running Cloudera CDH2 which I've just checked includes a
> back ported hdfs-630 patch.
>
> I guess a lot of these issues will be gone once hadoop 0.21 is out and
> hbase can take advantage of the new features.
>

Thats the hope.  A bunch of fixes have gone in for hbase provoked hdfs
issues in 0.20.  Look out for the append branch in hdfs 0.20 coming
soon (It'll be here:
http://svn.apache.org/viewvc/hadoop/common/branches/).  It'll be a
0.20 branch with support for append (hdfs-200, hdfs-142, etc.) and
other fixes needed by hbase.  Thats what the next major hbase will
ship against (CDH3 will include this stuff and then some, if I
understand Todd+crew's plans correctly).

Good on you Dan,
St.Ack


> Thanks,
>
> On 2 June 2010 01:10, Stack <st...@duboce.net> wrote:
>> Hey Dan:
>>
>> On Tue, Jun 1, 2010 at 2:57 AM, Dan Harvey <da...@mendeley.com> wrote:
>>> In what cases would a datanode failure (for example running out of
>>> memory in ourcase) cause HBase data loss?
>>
>> We should just move past the damaged DN on to the other replicas but
>> there are probably places where we can get hungup.  Out of interest
>> are you running with hdfs-630 inplace?
>>
>>> Would it mostly only causes dataloss to the meta regions or does it
>>> also cause problems with the actual region files?
>>>
>>
>> HDFS files that had their blocks located on the damaged DN would be
>> susceptible (meta files are just like any other).
>>
>> St.Ack
>>
>>>> On Mon, May 24, 2010 at 2:39 PM, Dan Harvey <da...@mendeley.com> wrote:
>>>>> Hi,
>>>>>
>>>>> Sorry for the multiple e-mails, it seems gmail didn't send my whole
>>>>> message last time! Anyway here it goes again...
>>>>>
>>>>> Whilst loading data via a mapreduce job into HBase I have started getting
>>>>> this error :-
>>>>>
>>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>>>>> contact region server Some server, retryOnlyOne=true, index=0,
>>>>> islastrow=false, tries=9, numtries=10, i=0, listsize=19,
>>>>> region=source_documents,ipubmed\x219915054,1274525958679 for region
>>>>> source_documents,ipubmed\x219915054,1274525958679, row 'u1012913162',
>>>>> but failed after 10 attempts.
>>>>> Exceptions:
>>>>> at org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1166)
>>>>> at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1247)
>>>>> at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:609)
>>>>>
>>>>> In the master there are the following three regions :-
>>>>>
>>>>> source_documents,ipubmed\x219859228,1274701893687       hadoop1
>>>>> 1825870642      ipubmed\x219859228      ipubmed\x219915054
>>>>> source_documents,ipubmed\x219915054,1274525958679       hadoop4
>>>>> 193393334        ipubmed\x219915054      u102193588
>>>>> source_documents,u102193588,1274486550122                    hadoop4
>>>>> 2141795358      u102193588                    u105043522
>>>>>
>>>>> and on one of our 5 nodes I found a region which start with
>>>>>
>>>>> ipubmed\x219915054 and ends with u102002564
>>>>>
>>>>> and on another I found the other half of the split which starts with
>>>>>
>>>>> u102002564 and ends with u102193588
>>>>>
>>>>> So it seems that the middle region on the master was split apart but
>>>>> that failed to reach the master.
>>>>>
>>>>> We've had a few problems over the last few days with hdfs nodes
>>>>> failing due to lack of memory which has now been fixed but could have
>>>>> been a cause of this problem.
>>>>>
>>>>> What ways can a split fail to be received by the master and how long
>>>>> would it take for hbase to fix this? I've read it periodically will
>>>>> scan the META table to find problems like this but didn't say how
>>>>> often? It has been about 12h here and our cluster didn't appear to
>>>>> have fixed this missing split, is there a way to force the master to
>>>>> rescan the META table? Will it fix problems like this given time?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> --
>>>>> Dan Harvey | Datamining Engineer
>>>>> www.mendeley.com/profiles/dan-harvey
>>>>>
>>>>> Mendeley Limited | London, UK | www.mendeley.com
>>>>> Registered in England and Wales | Company Number 6419015
>>>>>
>>>>
>>>
>>> --
>>> Dan Harvey | Datamining Engineer
>>> www.mendeley.com/profiles/dan-harvey
>>>
>>> Mendeley Limited | London, UK | www.mendeley.com
>>> Registered in England and Wales | Company Number 6419015
>>>
>>
>
> --
> Dan Harvey | Datamining Engineer
> www.mendeley.com/profiles/dan-harvey
>
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company Number 6419015
>

Re: Missing Split (full message)

Posted by Dan Harvey <da...@mendeley.com>.

Yes we're running Cloudera CDH2 which I've just checked includes a
back ported hdfs-630 patch.

I guess a lot of these issues will be gone once hadoop 0.21 is out and
hbase can take advantage of the new features.

Thanks,

On 2 June 2010 01:10, Stack <st...@duboce.net> wrote:
> Hey Dan:
>
> On Tue, Jun 1, 2010 at 2:57 AM, Dan Harvey <da...@mendeley.com> wrote:
>> In what cases would a datanode failure (for example running out of
>> memory in ourcase) cause HBase data loss?
>
> We should just move past the damaged DN on to the other replicas but
> there are probably places where we can get hungup.  Out of interest
> are you running with hdfs-630 inplace?
>
>> Would it mostly only causes dataloss to the meta regions or does it
>> also cause problems with the actual region files?
>>
>
> HDFS files that had their blocks located on the damaged DN would be
> susceptible (meta files are just like any other).
>
> St.Ack
>
>>> On Mon, May 24, 2010 at 2:39 PM, Dan Harvey <da...@mendeley.com> wrote:
>>>> Hi,
>>>>
>>>> Sorry for the multiple e-mails, it seems gmail didn't send my whole
>>>> message last time! Anyway here it goes again...
>>>>
>>>> Whilst loading data via a mapreduce job into HBase I have started getting
>>>> this error :-
>>>>
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>>>> contact region server Some server, retryOnlyOne=true, index=0,
>>>> islastrow=false, tries=9, numtries=10, i=0, listsize=19,
>>>> region=source_documents,ipubmed\x219915054,1274525958679 for region
>>>> source_documents,ipubmed\x219915054,1274525958679, row 'u1012913162',
>>>> but failed after 10 attempts.
>>>> Exceptions:
>>>> at org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1166)
>>>> at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1247)
>>>> at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:609)
>>>>
>>>> In the master there are the following three regions :-
>>>>
>>>> source_documents,ipubmed\x219859228,1274701893687       hadoop1
>>>> 1825870642      ipubmed\x219859228      ipubmed\x219915054
>>>> source_documents,ipubmed\x219915054,1274525958679       hadoop4
>>>> 193393334        ipubmed\x219915054      u102193588
>>>> source_documents,u102193588,1274486550122                    hadoop4
>>>> 2141795358      u102193588                    u105043522
>>>>
>>>> and on one of our 5 nodes I found a region which start with
>>>>
>>>> ipubmed\x219915054 and ends with u102002564
>>>>
>>>> and on another I found the other half of the split which starts with
>>>>
>>>> u102002564 and ends with u102193588
>>>>
>>>> So it seems that the middle region on the master was split apart but
>>>> that failed to reach the master.
>>>>
>>>> We've had a few problems over the last few days with hdfs nodes
>>>> failing due to lack of memory which has now been fixed but could have
>>>> been a cause of this problem.
>>>>
>>>> What ways can a split fail to be received by the master and how long
>>>> would it take for hbase to fix this? I've read it periodically will
>>>> scan the META table to find problems like this but didn't say how
>>>> often? It has been about 12h here and our cluster didn't appear to
>>>> have fixed this missing split, is there a way to force the master to
>>>> rescan the META table? Will it fix problems like this given time?
>>>>
>>>> Thanks,
>>>>
>>>> --
>>>> Dan Harvey | Datamining Engineer
>>>> www.mendeley.com/profiles/dan-harvey
>>>>
>>>> Mendeley Limited | London, UK | www.mendeley.com
>>>> Registered in England and Wales | Company Number 6419015
>>>>
>>>
>>
>> --
>> Dan Harvey | Datamining Engineer
>> www.mendeley.com/profiles/dan-harvey
>>
>> Mendeley Limited | London, UK | www.mendeley.com
>> Registered in England and Wales | Company Number 6419015
>>
>

-- 
Dan Harvey | Datamining Engineer
www.mendeley.com/profiles/dan-harvey

Mendeley Limited | London, UK | www.mendeley.com
Registered in England and Wales | Company Number 6419015

Re: Missing Split (full message)

Posted by Stack <st...@duboce.net>.

Hey Dan:

On Tue, Jun 1, 2010 at 2:57 AM, Dan Harvey <da...@mendeley.com> wrote:
> In what cases would a datanode failure (for example running out of
> memory in ourcase) cause HBase data loss?

We should just move past the damaged DN on to the other replicas but
there are probably places where we can get hungup.  Out of interest
are you running with hdfs-630 inplace?

> Would it mostly only causes dataloss to the meta regions or does it
> also cause problems with the actual region files?
>

HDFS files that had their blocks located on the damaged DN would be
susceptible (meta files are just like any other).

St.Ack

>> On Mon, May 24, 2010 at 2:39 PM, Dan Harvey <da...@mendeley.com> wrote:
>>> Hi,
>>>
>>> Sorry for the multiple e-mails, it seems gmail didn't send my whole
>>> message last time! Anyway here it goes again...
>>>
>>> Whilst loading data via a mapreduce job into HBase I have started getting
>>> this error :-
>>>
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>>> contact region server Some server, retryOnlyOne=true, index=0,
>>> islastrow=false, tries=9, numtries=10, i=0, listsize=19,
>>> region=source_documents,ipubmed\x219915054,1274525958679 for region
>>> source_documents,ipubmed\x219915054,1274525958679, row 'u1012913162',
>>> but failed after 10 attempts.
>>> Exceptions:
>>> at org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1166)
>>> at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1247)
>>> at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:609)
>>>
>>> In the master there are the following three regions :-
>>>
>>> source_documents,ipubmed\x219859228,1274701893687       hadoop1
>>> 1825870642      ipubmed\x219859228      ipubmed\x219915054
>>> source_documents,ipubmed\x219915054,1274525958679       hadoop4
>>> 193393334        ipubmed\x219915054      u102193588
>>> source_documents,u102193588,1274486550122                    hadoop4
>>> 2141795358      u102193588                    u105043522
>>>
>>> and on one of our 5 nodes I found a region which start with
>>>
>>> ipubmed\x219915054 and ends with u102002564
>>>
>>> and on another I found the other half of the split which starts with
>>>
>>> u102002564 and ends with u102193588
>>>
>>> So it seems that the middle region on the master was split apart but
>>> that failed to reach the master.
>>>
>>> We've had a few problems over the last few days with hdfs nodes
>>> failing due to lack of memory which has now been fixed but could have
>>> been a cause of this problem.
>>>
>>> What ways can a split fail to be received by the master and how long
>>> would it take for hbase to fix this? I've read it periodically will
>>> scan the META table to find problems like this but didn't say how
>>> often? It has been about 12h here and our cluster didn't appear to
>>> have fixed this missing split, is there a way to force the master to
>>> rescan the META table? Will it fix problems like this given time?
>>>
>>> Thanks,
>>>
>>> --
>>> Dan Harvey | Datamining Engineer
>>> www.mendeley.com/profiles/dan-harvey
>>>
>>> Mendeley Limited | London, UK | www.mendeley.com
>>> Registered in England and Wales | Company Number 6419015
>>>
>>
>
> --
> Dan Harvey | Datamining Engineer
> www.mendeley.com/profiles/dan-harvey
>
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company Number 6419015
>