You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by David Koch <og...@googlemail.com> on 2013/11/22 16:14:12 UTC

One Region Server fails - all M/R jobs crash.

Hello,

We experience reliability problems when running M/R jobs over HBase tables.
Specifically, it suffices for one Region Server to crash in order to fail
all M/R jobs.

My guess is that this is not normal with a replication factor of 3.

The HBase version is 0.94.6 installed as part of of Cloudera 4.4. HBase
settings are pre-sets. Cluster size is 30 machines.

What steps can I follow to improve the situation?

Thank you,

/David

Re: One Region Server fails - all M/R jobs crash.

Posted by Asaf Mesika <as...@gmail.com>.
Coming to think about it - I think it make perfect sense for HBase to
protect against oome eminating from user queries. Just think MySQL would
crash since there are too many queries or one giant query.



On Fri, Nov 22, 2013 at 10:06 PM, Dhaval Shah
<pr...@yahoo.co.in>wrote:

> How big can your rows get? If you have a million columns on a row, you
> might run your region server out of memory. Can you try setBatch to a
> smaller number and test if that works?
>
> 10k regions is too many Can you try and increase your max file size and
> see if that helps.
>
> 8 cores / 1 disk is a bad combination. Can you look at disk IO during the
> time of crash and see if you find anything there.
>
> You might also be swapping. Can you look at your GC logs?
>
> You are running dangerously close to the fence with the kind of hardware
> you have.
>
> Regards,
> Dhaval
>
>
> ________________________________
>  From: David Koch <og...@googlemail.com>
> To: user@hbase.apache.org
> Sent: Friday, 22 November 2013 2:43 PM
> Subject: Re: One Region Server fails - all M/R jobs crash.
>
>
> Hello,
>
> Thank you for your replies.
>
> Not that it matters but, cache is 1, batch is -1 on the scan i.e each RPC
> call returns one row. The jobs don't write any data back to HBase,
> compaction is de-activated and done manually. At the time of the crash all
> datanodes were fine, hbchk showed no inconsistencies. Table size is about
> 10k regions/3 billion records on the largest tables and we do a lot of
> server side filtering to limit what's sent across the network.
>
> Our machines may not be the most powerful, 32GB RAM, 8 cores, 1 disk. It's
> also true that when we took a closer look in the past it turned out that
> most of the issues we had were somehow rooted in the fact that CPUs were
> overloaded, not enough memory available - hardware stuff.
>
> What I don't get is why HBase always crashes. I mean if it's slow ok - the
> hardware is a bottleneck but at least you'd expect it to pull through
> eventually. Some days all jobs work fine, some days they don't and there is
> no telling why. HBase's erratic behavior has been causing us a lot of
> headache and we have been spending way too much time fiddling with HBase
> configuration settings over the past 18 months.
>
> /David
>
>
>
> On Fri, Nov 22, 2013 at 7:05 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Thanks Dhaval for the analysis.
> >
> > bq. The HBase version is 0.94.6
> >
> > David:
> > Please upgrade to 0.94.13, if possible. There have been several JIRAs
> > backporting patches from trunk where jdk 1.7 is supported.
> >
> > Please also check your DataNode log to see whether there was problem
> there
> > (likely there was).
> >
> > Cheers
> >
> >
> > On Sat, Nov 23, 2013 at 2:00 AM, Dhaval Shah <
> prince_mithibai@yahoo.co.in
> > >wrote:
> >
> > > You logs suggest that you are overloading resources
> > > (servers/network/memory). How much data are you scanning with your MR
> > job,
> > > how much are you writing back to HBase? What values are you setting for
> > > setBatch, setCaching, setCacheBlocks? How much memory do you have on
> your
> > > region servers? 1 server crashing should not cause a job to fail
> because
> > it
> > > will move on to the next one (given the right parmas for retries and
> > retry
> > > interval are set). Your region server logs suggest that its way more
> > > complicated than that.
> > >
> > > 2013-11-17 09:58:37,513 WARN
> > > org.apache.hadoop.hbase.regionserver.HRegionServer: Received close for
> > > region we are already opening or closing;
> > e54b8e16ffbe2187b9017fef596c62aa
> > >
> > > looks like some state inconsistency issue
> > >
> > > I also see that you are using Java 7. Though some people have had
> success
> > > using it, I am not sure if Java 7 is currently the recommended version
> > > (most people use Java 6!)
> > >
> > > 2013-11-18 18:01:47,959 INFO org.apache.zookeeper.ClientCnxn: Unable to
> > > read additional data from server sessionid 0x342654dfdd30017, likely
> > server
> > > has closed socket, closing socket connection and attempting reconnect
> > >
> > > This line is suggesting a problem with your zookeeper. If zookeeper
> > screws
> > > up, HBase will and hence your MR job over HBase will.
> > >
> > > 2013-11-21 06:54:01,105 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> to
> > > connect to /XXX.XXX.XXX.XXX:50010 for block, add to deadNodes and
> > continue.
> > > java.net.ConnectException: Connection refused
> > >
> > > And this suggests datanode crashed. So many processes (don't know if
> they
> > > belong to the same server or not) crashing at the same time seems to
> be a
> > > load issue or a network issue to me.
> > >
> > >
> > >
> > > Regards,
> > > Dhaval
> > >
> > >
> > > ________________________________
> > >  From: David Koch <og...@googlemail.com>
> > > To: user@hbase.apache.org
> > > Sent: Friday, 22 November 2013 12:35 PM
> > > Subject: Re: One Region Server fails - all M/R jobs crash.
> > >
> > >
> > > Here you go:
> > >
> > > Task log: http://pastebin.com/VePTLHEk
> > > Region Server log: http://pastebin.com/iu8y0VYL
> > >
> > >
> > >
> > > On Fri, Nov 22, 2013 at 6:27 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > Attachment didn't go through.
> > > >
> > > > Can you pastebin their contents ?
> > > >
> > > > Thanks
> > > >
> > > > On Nov 23, 2013, at 12:55 AM, David Koch <og...@googlemail.com>
> > wrote:
> > > >
> > > > > Sorry for the previous message, I attach the equired log files.
> > > > >
> > > > > Regards,
> > > > >
> > > > > David
> > > > >
> > > > >
> > > > > On Fri, Nov 22, 2013 at 5:53 PM, David Koch <ogdude@googlemail.com
> >
> > > > wrote:
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Fri, Nov 22, 2013 at 4:17 PM, Ted Yu <yu...@gmail.com>
> > wrote:
> > > > >>> Can you pastebin snippet of:
> > > > >>> 1. task logs which show failure
> > > > >>> 2. region server log shortly before the crash
> > > > >>>
> > > > >>> Thanks
> > > > >>>
> > > > >>>
> > > > >>> On Fri, Nov 22, 2013 at 7:14 AM, David Koch <
> ogdude@googlemail.com
> > >
> > > > wrote:
> > > > >>>
> > > > >>> > Hello,
> > > > >>> >
> > > > >>> > We experience reliability problems when running M/R jobs over
> > HBase
> > > > tables.
> > > > >>> > Specifically, it suffices for one Region Server to crash in
> order
> > > to
> > > > fail
> > > > >>> > all M/R jobs.
> > > > >>> >
> > > > >>> > My guess is that this is not normal with a replication factor
> of
> > 3.
> > > > >>> >
> > > > >>> > The HBase version is 0.94.6 installed as part of of Cloudera
> 4.4.
> > > > HBase
> > > > >>> > settings are pre-sets. Cluster size is 30 machines.
> > > > >>> >
> > > > >>> > What steps can I follow to improve the situation?
> > > > >>> >
> > > > >>> > Thank you,
> > > > >>> >
> > > > >>> > /David
> > > > >>> >
> > > > >
> > > >
> > >
> >
>

Re: One Region Server fails - all M/R jobs crash.

Posted by Dhaval Shah <pr...@yahoo.co.in>.
Hmm ok. You are right. The principle of Hadoop/HBase is do do big data on commodity hardware but that's not to say you can do it without enough hardware. Get 8 commodity disks and see your performance/throughput numbers improve suibstantially. Before jumping into buying anything though, I would suggest you look at hardware utilization when the problem happens. That would tell you what your most pressing need is
 
Regards, 
Dhaval


________________________________
 From: David Koch <og...@googlemail.com>
To: user@hbase.apache.org; Dhaval Shah <pr...@yahoo.co.in> 
Sent: Monday, 25 November 2013 3:36 AM
Subject: Re: One Region Server fails - all M/R jobs crash.
 

Hi Dhaval,

Yes, rows can get very big, that's why we filter them. The filter lets KVs
pass as long as the KV count is < MAX_LIMIT and skips the row entirely once
the count exceeds this limit. KV size is about constant. Alternatively, we
could use batching, you are right.

Also, with regard to the Java version used. Cloudera 4 installs its own JVM
which happens to be Java 7 so it's not a choice we made.

I always thought the principle of Hadoop/HBase was to do big data on
commodity hardware. You suggest we get 1 disk per CPU? I am by no means an
expert in setting up this kind of system.

Thanks again for your response,

/David



On Fri, Nov 22, 2013 at 9:06 PM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> How big can your rows get? If you have a million columns on a row, you
> might run your region server out of memory. Can you try setBatch to a
> smaller number and test if that works?
>
> 10k regions is too many Can you try and increase your max file size and
> see if that helps.
>
> 8 cores / 1 disk is a bad combination. Can you look at disk IO during the
> time of crash and see if you find anything there.
>
> You might also be swapping. Can you look at your GC logs?
>
> You are running dangerously close to the fence with the kind of hardware
> you have.
>
> Regards,
> Dhaval
>
>
> ________________________________
>  From: David Koch <og...@googlemail.com>
> To: user@hbase.apache.org
> Sent: Friday, 22 November 2013 2:43 PM
> Subject: Re: One Region Server fails - all M/R jobs crash.
>
>
> Hello,
>
> Thank you for your replies.
>
> Not that it matters but, cache is 1, batch is -1 on the scan i.e each RPC
> call returns one row. The jobs don't write any data back to HBase,
> compaction is de-activated and done manually. At the time of the crash all
> datanodes were fine, hbchk showed no inconsistencies. Table size is about
> 10k regions/3 billion records on the largest tables and we do a lot of
> server side filtering to limit what's sent across the network.
>
> Our machines may not be the most powerful, 32GB RAM, 8 cores, 1 disk. It's
> also true that when we took a closer look in the past it turned out that
> most of the issues we had were somehow rooted in the fact that CPUs were
> overloaded, not enough memory available - hardware stuff.
>
> What I don't get is why HBase always crashes. I mean if it's slow ok - the
> hardware is a bottleneck but at least you'd expect it to pull through
> eventually. Some days all jobs work fine, some days they don't and there is
> no telling why. HBase's erratic behavior has been causing us a lot of
> headache and we have been spending way too much time fiddling with HBase
> configuration settings over the past 18 months.
>
> /David
>
>
>
> On Fri, Nov 22, 2013 at 7:05 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Thanks Dhaval for the analysis.
> >
> > bq. The HBase version is 0.94.6
> >
> > David:
> > Please upgrade to 0.94.13, if possible. There have been several JIRAs
> > backporting patches from trunk where jdk 1.7 is supported.
> >
> > Please also check your DataNode log to see whether there was problem
> there
> > (likely there was).
> >
> > Cheers
> >
> >
> > On Sat, Nov 23, 2013 at 2:00 AM, Dhaval Shah <
> prince_mithibai@yahoo.co.in
> > >wrote:
> >
> > > You logs suggest that you are overloading resources
> > > (servers/network/memory). How much data are you scanning with your MR
> > job,
> > > how much are you writing back to HBase? What values are you setting for
> > > setBatch, setCaching, setCacheBlocks? How much memory do you have on
> your
> > > region servers? 1 server crashing should not cause a job to fail
> because
> > it
> > > will move on to the next one (given the right parmas for retries and
> > retry
> > > interval are set). Your region server logs suggest that its way more
> > > complicated than that.
> > >
> > > 2013-11-17 09:58:37,513 WARN
> > > org.apache.hadoop.hbase.regionserver.HRegionServer: Received close for
> > > region we are already opening or closing;
> > e54b8e16ffbe2187b9017fef596c62aa
> > >
> > > looks like some state inconsistency issue
> > >
> > > I also see that you are using Java 7. Though some people have had
> success
> > > using it, I am not sure if Java 7 is currently the recommended version
> > > (most people use Java 6!)
> > >
> > > 2013-11-18 18:01:47,959 INFO org.apache.zookeeper.ClientCnxn: Unable to
> > > read additional data from server sessionid 0x342654dfdd30017, likely
> > server
> > > has closed socket, closing socket connection and attempting reconnect
> > >
> > > This line is suggesting a problem with your zookeeper. If zookeeper
> > screws
> > > up, HBase will and hence your MR job over HBase will.
> > >
> > > 2013-11-21 06:54:01,105 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> to
> > > connect to /XXX.XXX.XXX.XXX:50010 for block, add to deadNodes and
> > continue.
> > > java.net.ConnectException: Connection refused
> > >
> > > And this suggests datanode crashed. So many processes (don't know if
> they
> > > belong to the same server or not) crashing at the same time seems to
> be a
> > > load issue or a network issue to me.
> > >
> > >
> > >
> > > Regards,
> > > Dhaval
> > >
> > >
> > > ________________________________
> > >  From: David Koch <og...@googlemail.com>
> > > To: user@hbase.apache.org
> > > Sent: Friday, 22 November 2013 12:35 PM
> > > Subject: Re: One Region Server fails - all M/R jobs crash.
> > >
> > >
> > > Here you go:
> > >
> > > Task log: http://pastebin.com/VePTLHEk
> > > Region Server log: http://pastebin.com/iu8y0VYL
> > >
> > >
> > >
> > > On Fri, Nov 22, 2013 at 6:27 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > Attachment didn't go through.
> > > >
> > > > Can you pastebin their contents ?
> > > >
> > > > Thanks
> > > >
> > > > On Nov 23, 2013, at 12:55 AM, David Koch <og...@googlemail.com>
> > wrote:
> > > >
> > > > > Sorry for the previous message, I attach the equired log files.
> > > > >
> > > > > Regards,
> > > > >
> > > > > David
> > > > >
> > > > >
> > > > > On Fri, Nov 22, 2013 at 5:53 PM, David Koch <ogdude@googlemail.com
> >
> > > > wrote:
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Fri, Nov 22, 2013 at 4:17 PM, Ted Yu <yu...@gmail.com>
> > wrote:
> > > > >>> Can you pastebin snippet of:
> > > > >>> 1. task logs which show failure
> > > > >>> 2. region server log shortly before the crash
> > > > >>>
> > > > >>> Thanks
> > > > >>>
> > > > >>>
> > > > >>> On Fri, Nov 22, 2013 at 7:14 AM, David Koch <
> ogdude@googlemail.com
> > >
> > > > wrote:
> > > > >>>
> > > > >>> > Hello,
> > > > >>> >
> > > > >>> > We experience reliability problems when running M/R jobs over
> > HBase
> > > > tables.
> > > > >>> > Specifically, it suffices for one Region Server to crash in
> order
> > > to
> > > > fail
> > > > >>> > all M/R jobs.
> > > > >>> >
> > > > >>> > My guess is that this is not normal with a replication factor
> of
> > 3.
> > > > >>> >
> > > > >>> > The HBase version is 0.94.6 installed as part of of Cloudera
> 4.4.
> > > > HBase
> > > > >>> > settings are pre-sets. Cluster size is 30 machines.
> > > > >>> >
> > > > >>> > What steps can I follow to improve the situation?
> > > > >>> >
> > > > >>> > Thank you,
> > > > >>> >
> > > > >>> > /David
> > > > >>> >
> > > > >
> > > >
> > >
> >
>

Re: One Region Server fails - all M/R jobs crash.

Posted by David Koch <og...@googlemail.com>.
Hi Dhaval,

Yes, rows can get very big, that's why we filter them. The filter lets KVs
pass as long as the KV count is < MAX_LIMIT and skips the row entirely once
the count exceeds this limit. KV size is about constant. Alternatively, we
could use batching, you are right.

Also, with regard to the Java version used. Cloudera 4 installs its own JVM
which happens to be Java 7 so it's not a choice we made.

I always thought the principle of Hadoop/HBase was to do big data on
commodity hardware. You suggest we get 1 disk per CPU? I am by no means an
expert in setting up this kind of system.

Thanks again for your response,

/David


On Fri, Nov 22, 2013 at 9:06 PM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> How big can your rows get? If you have a million columns on a row, you
> might run your region server out of memory. Can you try setBatch to a
> smaller number and test if that works?
>
> 10k regions is too many Can you try and increase your max file size and
> see if that helps.
>
> 8 cores / 1 disk is a bad combination. Can you look at disk IO during the
> time of crash and see if you find anything there.
>
> You might also be swapping. Can you look at your GC logs?
>
> You are running dangerously close to the fence with the kind of hardware
> you have.
>
> Regards,
> Dhaval
>
>
> ________________________________
>  From: David Koch <og...@googlemail.com>
> To: user@hbase.apache.org
> Sent: Friday, 22 November 2013 2:43 PM
> Subject: Re: One Region Server fails - all M/R jobs crash.
>
>
> Hello,
>
> Thank you for your replies.
>
> Not that it matters but, cache is 1, batch is -1 on the scan i.e each RPC
> call returns one row. The jobs don't write any data back to HBase,
> compaction is de-activated and done manually. At the time of the crash all
> datanodes were fine, hbchk showed no inconsistencies. Table size is about
> 10k regions/3 billion records on the largest tables and we do a lot of
> server side filtering to limit what's sent across the network.
>
> Our machines may not be the most powerful, 32GB RAM, 8 cores, 1 disk. It's
> also true that when we took a closer look in the past it turned out that
> most of the issues we had were somehow rooted in the fact that CPUs were
> overloaded, not enough memory available - hardware stuff.
>
> What I don't get is why HBase always crashes. I mean if it's slow ok - the
> hardware is a bottleneck but at least you'd expect it to pull through
> eventually. Some days all jobs work fine, some days they don't and there is
> no telling why. HBase's erratic behavior has been causing us a lot of
> headache and we have been spending way too much time fiddling with HBase
> configuration settings over the past 18 months.
>
> /David
>
>
>
> On Fri, Nov 22, 2013 at 7:05 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Thanks Dhaval for the analysis.
> >
> > bq. The HBase version is 0.94.6
> >
> > David:
> > Please upgrade to 0.94.13, if possible. There have been several JIRAs
> > backporting patches from trunk where jdk 1.7 is supported.
> >
> > Please also check your DataNode log to see whether there was problem
> there
> > (likely there was).
> >
> > Cheers
> >
> >
> > On Sat, Nov 23, 2013 at 2:00 AM, Dhaval Shah <
> prince_mithibai@yahoo.co.in
> > >wrote:
> >
> > > You logs suggest that you are overloading resources
> > > (servers/network/memory). How much data are you scanning with your MR
> > job,
> > > how much are you writing back to HBase? What values are you setting for
> > > setBatch, setCaching, setCacheBlocks? How much memory do you have on
> your
> > > region servers? 1 server crashing should not cause a job to fail
> because
> > it
> > > will move on to the next one (given the right parmas for retries and
> > retry
> > > interval are set). Your region server logs suggest that its way more
> > > complicated than that.
> > >
> > > 2013-11-17 09:58:37,513 WARN
> > > org.apache.hadoop.hbase.regionserver.HRegionServer: Received close for
> > > region we are already opening or closing;
> > e54b8e16ffbe2187b9017fef596c62aa
> > >
> > > looks like some state inconsistency issue
> > >
> > > I also see that you are using Java 7. Though some people have had
> success
> > > using it, I am not sure if Java 7 is currently the recommended version
> > > (most people use Java 6!)
> > >
> > > 2013-11-18 18:01:47,959 INFO org.apache.zookeeper.ClientCnxn: Unable to
> > > read additional data from server sessionid 0x342654dfdd30017, likely
> > server
> > > has closed socket, closing socket connection and attempting reconnect
> > >
> > > This line is suggesting a problem with your zookeeper. If zookeeper
> > screws
> > > up, HBase will and hence your MR job over HBase will.
> > >
> > > 2013-11-21 06:54:01,105 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> to
> > > connect to /XXX.XXX.XXX.XXX:50010 for block, add to deadNodes and
> > continue.
> > > java.net.ConnectException: Connection refused
> > >
> > > And this suggests datanode crashed. So many processes (don't know if
> they
> > > belong to the same server or not) crashing at the same time seems to
> be a
> > > load issue or a network issue to me.
> > >
> > >
> > >
> > > Regards,
> > > Dhaval
> > >
> > >
> > > ________________________________
> > >  From: David Koch <og...@googlemail.com>
> > > To: user@hbase.apache.org
> > > Sent: Friday, 22 November 2013 12:35 PM
> > > Subject: Re: One Region Server fails - all M/R jobs crash.
> > >
> > >
> > > Here you go:
> > >
> > > Task log: http://pastebin.com/VePTLHEk
> > > Region Server log: http://pastebin.com/iu8y0VYL
> > >
> > >
> > >
> > > On Fri, Nov 22, 2013 at 6:27 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > Attachment didn't go through.
> > > >
> > > > Can you pastebin their contents ?
> > > >
> > > > Thanks
> > > >
> > > > On Nov 23, 2013, at 12:55 AM, David Koch <og...@googlemail.com>
> > wrote:
> > > >
> > > > > Sorry for the previous message, I attach the equired log files.
> > > > >
> > > > > Regards,
> > > > >
> > > > > David
> > > > >
> > > > >
> > > > > On Fri, Nov 22, 2013 at 5:53 PM, David Koch <ogdude@googlemail.com
> >
> > > > wrote:
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Fri, Nov 22, 2013 at 4:17 PM, Ted Yu <yu...@gmail.com>
> > wrote:
> > > > >>> Can you pastebin snippet of:
> > > > >>> 1. task logs which show failure
> > > > >>> 2. region server log shortly before the crash
> > > > >>>
> > > > >>> Thanks
> > > > >>>
> > > > >>>
> > > > >>> On Fri, Nov 22, 2013 at 7:14 AM, David Koch <
> ogdude@googlemail.com
> > >
> > > > wrote:
> > > > >>>
> > > > >>> > Hello,
> > > > >>> >
> > > > >>> > We experience reliability problems when running M/R jobs over
> > HBase
> > > > tables.
> > > > >>> > Specifically, it suffices for one Region Server to crash in
> order
> > > to
> > > > fail
> > > > >>> > all M/R jobs.
> > > > >>> >
> > > > >>> > My guess is that this is not normal with a replication factor
> of
> > 3.
> > > > >>> >
> > > > >>> > The HBase version is 0.94.6 installed as part of of Cloudera
> 4.4.
> > > > HBase
> > > > >>> > settings are pre-sets. Cluster size is 30 machines.
> > > > >>> >
> > > > >>> > What steps can I follow to improve the situation?
> > > > >>> >
> > > > >>> > Thank you,
> > > > >>> >
> > > > >>> > /David
> > > > >>> >
> > > > >
> > > >
> > >
> >
>

Re: One Region Server fails - all M/R jobs crash.

Posted by Dhaval Shah <pr...@yahoo.co.in>.
How big can your rows get? If you have a million columns on a row, you might run your region server out of memory. Can you try setBatch to a smaller number and test if that works?

10k regions is too many Can you try and increase your max file size and see if that helps.

8 cores / 1 disk is a bad combination. Can you look at disk IO during the time of crash and see if you find anything there. 

You might also be swapping. Can you look at your GC logs? 

You are running dangerously close to the fence with the kind of hardware you have. 
 
Regards, 
Dhaval


________________________________
 From: David Koch <og...@googlemail.com>
To: user@hbase.apache.org 
Sent: Friday, 22 November 2013 2:43 PM
Subject: Re: One Region Server fails - all M/R jobs crash.
 

Hello,

Thank you for your replies.

Not that it matters but, cache is 1, batch is -1 on the scan i.e each RPC
call returns one row. The jobs don't write any data back to HBase,
compaction is de-activated and done manually. At the time of the crash all
datanodes were fine, hbchk showed no inconsistencies. Table size is about
10k regions/3 billion records on the largest tables and we do a lot of
server side filtering to limit what's sent across the network.

Our machines may not be the most powerful, 32GB RAM, 8 cores, 1 disk. It's
also true that when we took a closer look in the past it turned out that
most of the issues we had were somehow rooted in the fact that CPUs were
overloaded, not enough memory available - hardware stuff.

What I don't get is why HBase always crashes. I mean if it's slow ok - the
hardware is a bottleneck but at least you'd expect it to pull through
eventually. Some days all jobs work fine, some days they don't and there is
no telling why. HBase's erratic behavior has been causing us a lot of
headache and we have been spending way too much time fiddling with HBase
configuration settings over the past 18 months.

/David



On Fri, Nov 22, 2013 at 7:05 PM, Ted Yu <yu...@gmail.com> wrote:

> Thanks Dhaval for the analysis.
>
> bq. The HBase version is 0.94.6
>
> David:
> Please upgrade to 0.94.13, if possible. There have been several JIRAs
> backporting patches from trunk where jdk 1.7 is supported.
>
> Please also check your DataNode log to see whether there was problem there
> (likely there was).
>
> Cheers
>
>
> On Sat, Nov 23, 2013 at 2:00 AM, Dhaval Shah <prince_mithibai@yahoo.co.in
> >wrote:
>
> > You logs suggest that you are overloading resources
> > (servers/network/memory). How much data are you scanning with your MR
> job,
> > how much are you writing back to HBase? What values are you setting for
> > setBatch, setCaching, setCacheBlocks? How much memory do you have on your
> > region servers? 1 server crashing should not cause a job to fail because
> it
> > will move on to the next one (given the right parmas for retries and
> retry
> > interval are set). Your region server logs suggest that its way more
> > complicated than that.
> >
> > 2013-11-17 09:58:37,513 WARN
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Received close for
> > region we are already opening or closing;
> e54b8e16ffbe2187b9017fef596c62aa
> >
> > looks like some state inconsistency issue
> >
> > I also see that you are using Java 7. Though some people have had success
> > using it, I am not sure if Java 7 is currently the recommended version
> > (most people use Java 6!)
> >
> > 2013-11-18 18:01:47,959 INFO org.apache.zookeeper.ClientCnxn: Unable to
> > read additional data from server sessionid 0x342654dfdd30017, likely
> server
> > has closed socket, closing socket connection and attempting reconnect
> >
> > This line is suggesting a problem with your zookeeper. If zookeeper
> screws
> > up, HBase will and hence your MR job over HBase will.
> >
> > 2013-11-21 06:54:01,105 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> > connect to /XXX.XXX.XXX.XXX:50010 for block, add to deadNodes and
> continue.
> > java.net.ConnectException: Connection refused
> >
> > And this suggests datanode crashed. So many processes (don't know if they
> > belong to the same server or not) crashing at the same time seems to be a
> > load issue or a network issue to me.
> >
> >
> >
> > Regards,
> > Dhaval
> >
> >
> > ________________________________
> >  From: David Koch <og...@googlemail.com>
> > To: user@hbase.apache.org
> > Sent: Friday, 22 November 2013 12:35 PM
> > Subject: Re: One Region Server fails - all M/R jobs crash.
> >
> >
> > Here you go:
> >
> > Task log: http://pastebin.com/VePTLHEk
> > Region Server log: http://pastebin.com/iu8y0VYL
> >
> >
> >
> > On Fri, Nov 22, 2013 at 6:27 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Attachment didn't go through.
> > >
> > > Can you pastebin their contents ?
> > >
> > > Thanks
> > >
> > > On Nov 23, 2013, at 12:55 AM, David Koch <og...@googlemail.com>
> wrote:
> > >
> > > > Sorry for the previous message, I attach the equired log files.
> > > >
> > > > Regards,
> > > >
> > > > David
> > > >
> > > >
> > > > On Fri, Nov 22, 2013 at 5:53 PM, David Koch <og...@googlemail.com>
> > > wrote:
> > > >>
> > > >>
> > > >>
> > > >> On Fri, Nov 22, 2013 at 4:17 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > > >>> Can you pastebin snippet of:
> > > >>> 1. task logs which show failure
> > > >>> 2. region server log shortly before the crash
> > > >>>
> > > >>> Thanks
> > > >>>
> > > >>>
> > > >>> On Fri, Nov 22, 2013 at 7:14 AM, David Koch <ogdude@googlemail.com
> >
> > > wrote:
> > > >>>
> > > >>> > Hello,
> > > >>> >
> > > >>> > We experience reliability problems when running M/R jobs over
> HBase
> > > tables.
> > > >>> > Specifically, it suffices for one Region Server to crash in order
> > to
> > > fail
> > > >>> > all M/R jobs.
> > > >>> >
> > > >>> > My guess is that this is not normal with a replication factor of
> 3.
> > > >>> >
> > > >>> > The HBase version is 0.94.6 installed as part of of Cloudera 4.4.
> > > HBase
> > > >>> > settings are pre-sets. Cluster size is 30 machines.
> > > >>> >
> > > >>> > What steps can I follow to improve the situation?
> > > >>> >
> > > >>> > Thank you,
> > > >>> >
> > > >>> > /David
> > > >>> >
> > > >
> > >
> >
>

Re: One Region Server fails - all M/R jobs crash.

Posted by David Koch <og...@googlemail.com>.
Hello,

Thank you for your replies.

Not that it matters but, cache is 1, batch is -1 on the scan i.e each RPC
call returns one row. The jobs don't write any data back to HBase,
compaction is de-activated and done manually. At the time of the crash all
datanodes were fine, hbchk showed no inconsistencies. Table size is about
10k regions/3 billion records on the largest tables and we do a lot of
server side filtering to limit what's sent across the network.

Our machines may not be the most powerful, 32GB RAM, 8 cores, 1 disk. It's
also true that when we took a closer look in the past it turned out that
most of the issues we had were somehow rooted in the fact that CPUs were
overloaded, not enough memory available - hardware stuff.

What I don't get is why HBase always crashes. I mean if it's slow ok - the
hardware is a bottleneck but at least you'd expect it to pull through
eventually. Some days all jobs work fine, some days they don't and there is
no telling why. HBase's erratic behavior has been causing us a lot of
headache and we have been spending way too much time fiddling with HBase
configuration settings over the past 18 months.

/David


On Fri, Nov 22, 2013 at 7:05 PM, Ted Yu <yu...@gmail.com> wrote:

> Thanks Dhaval for the analysis.
>
> bq. The HBase version is 0.94.6
>
> David:
> Please upgrade to 0.94.13, if possible. There have been several JIRAs
> backporting patches from trunk where jdk 1.7 is supported.
>
> Please also check your DataNode log to see whether there was problem there
> (likely there was).
>
> Cheers
>
>
> On Sat, Nov 23, 2013 at 2:00 AM, Dhaval Shah <prince_mithibai@yahoo.co.in
> >wrote:
>
> > You logs suggest that you are overloading resources
> > (servers/network/memory). How much data are you scanning with your MR
> job,
> > how much are you writing back to HBase? What values are you setting for
> > setBatch, setCaching, setCacheBlocks? How much memory do you have on your
> > region servers? 1 server crashing should not cause a job to fail because
> it
> > will move on to the next one (given the right parmas for retries and
> retry
> > interval are set). Your region server logs suggest that its way more
> > complicated than that.
> >
> > 2013-11-17 09:58:37,513 WARN
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Received close for
> > region we are already opening or closing;
> e54b8e16ffbe2187b9017fef596c62aa
> >
> > looks like some state inconsistency issue
> >
> > I also see that you are using Java 7. Though some people have had success
> > using it, I am not sure if Java 7 is currently the recommended version
> > (most people use Java 6!)
> >
> > 2013-11-18 18:01:47,959 INFO org.apache.zookeeper.ClientCnxn: Unable to
> > read additional data from server sessionid 0x342654dfdd30017, likely
> server
> > has closed socket, closing socket connection and attempting reconnect
> >
> > This line is suggesting a problem with your zookeeper. If zookeeper
> screws
> > up, HBase will and hence your MR job over HBase will.
> >
> > 2013-11-21 06:54:01,105 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> > connect to /XXX.XXX.XXX.XXX:50010 for block, add to deadNodes and
> continue.
> > java.net.ConnectException: Connection refused
> >
> > And this suggests datanode crashed. So many processes (don't know if they
> > belong to the same server or not) crashing at the same time seems to be a
> > load issue or a network issue to me.
> >
> >
> >
> > Regards,
> > Dhaval
> >
> >
> > ________________________________
> >  From: David Koch <og...@googlemail.com>
> > To: user@hbase.apache.org
> > Sent: Friday, 22 November 2013 12:35 PM
> > Subject: Re: One Region Server fails - all M/R jobs crash.
> >
> >
> > Here you go:
> >
> > Task log: http://pastebin.com/VePTLHEk
> > Region Server log: http://pastebin.com/iu8y0VYL
> >
> >
> >
> > On Fri, Nov 22, 2013 at 6:27 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Attachment didn't go through.
> > >
> > > Can you pastebin their contents ?
> > >
> > > Thanks
> > >
> > > On Nov 23, 2013, at 12:55 AM, David Koch <og...@googlemail.com>
> wrote:
> > >
> > > > Sorry for the previous message, I attach the equired log files.
> > > >
> > > > Regards,
> > > >
> > > > David
> > > >
> > > >
> > > > On Fri, Nov 22, 2013 at 5:53 PM, David Koch <og...@googlemail.com>
> > > wrote:
> > > >>
> > > >>
> > > >>
> > > >> On Fri, Nov 22, 2013 at 4:17 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > > >>> Can you pastebin snippet of:
> > > >>> 1. task logs which show failure
> > > >>> 2. region server log shortly before the crash
> > > >>>
> > > >>> Thanks
> > > >>>
> > > >>>
> > > >>> On Fri, Nov 22, 2013 at 7:14 AM, David Koch <ogdude@googlemail.com
> >
> > > wrote:
> > > >>>
> > > >>> > Hello,
> > > >>> >
> > > >>> > We experience reliability problems when running M/R jobs over
> HBase
> > > tables.
> > > >>> > Specifically, it suffices for one Region Server to crash in order
> > to
> > > fail
> > > >>> > all M/R jobs.
> > > >>> >
> > > >>> > My guess is that this is not normal with a replication factor of
> 3.
> > > >>> >
> > > >>> > The HBase version is 0.94.6 installed as part of of Cloudera 4.4.
> > > HBase
> > > >>> > settings are pre-sets. Cluster size is 30 machines.
> > > >>> >
> > > >>> > What steps can I follow to improve the situation?
> > > >>> >
> > > >>> > Thank you,
> > > >>> >
> > > >>> > /David
> > > >>> >
> > > >
> > >
> >
>

Re: One Region Server fails - all M/R jobs crash.

Posted by Ted Yu <yu...@gmail.com>.
Thanks Dhaval for the analysis.

bq. The HBase version is 0.94.6

David:
Please upgrade to 0.94.13, if possible. There have been several JIRAs
backporting patches from trunk where jdk 1.7 is supported.

Please also check your DataNode log to see whether there was problem there
(likely there was).

Cheers


On Sat, Nov 23, 2013 at 2:00 AM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> You logs suggest that you are overloading resources
> (servers/network/memory). How much data are you scanning with your MR job,
> how much are you writing back to HBase? What values are you setting for
> setBatch, setCaching, setCacheBlocks? How much memory do you have on your
> region servers? 1 server crashing should not cause a job to fail because it
> will move on to the next one (given the right parmas for retries and retry
> interval are set). Your region server logs suggest that its way more
> complicated than that.
>
> 2013-11-17 09:58:37,513 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received close for
> region we are already opening or closing; e54b8e16ffbe2187b9017fef596c62aa
>
> looks like some state inconsistency issue
>
> I also see that you are using Java 7. Though some people have had success
> using it, I am not sure if Java 7 is currently the recommended version
> (most people use Java 6!)
>
> 2013-11-18 18:01:47,959 INFO org.apache.zookeeper.ClientCnxn: Unable to
> read additional data from server sessionid 0x342654dfdd30017, likely server
> has closed socket, closing socket connection and attempting reconnect
>
> This line is suggesting a problem with your zookeeper. If zookeeper screws
> up, HBase will and hence your MR job over HBase will.
>
> 2013-11-21 06:54:01,105 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to /XXX.XXX.XXX.XXX:50010 for block, add to deadNodes and continue.
> java.net.ConnectException: Connection refused
>
> And this suggests datanode crashed. So many processes (don't know if they
> belong to the same server or not) crashing at the same time seems to be a
> load issue or a network issue to me.
>
>
>
> Regards,
> Dhaval
>
>
> ________________________________
>  From: David Koch <og...@googlemail.com>
> To: user@hbase.apache.org
> Sent: Friday, 22 November 2013 12:35 PM
> Subject: Re: One Region Server fails - all M/R jobs crash.
>
>
> Here you go:
>
> Task log: http://pastebin.com/VePTLHEk
> Region Server log: http://pastebin.com/iu8y0VYL
>
>
>
> On Fri, Nov 22, 2013 at 6:27 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Attachment didn't go through.
> >
> > Can you pastebin their contents ?
> >
> > Thanks
> >
> > On Nov 23, 2013, at 12:55 AM, David Koch <og...@googlemail.com> wrote:
> >
> > > Sorry for the previous message, I attach the equired log files.
> > >
> > > Regards,
> > >
> > > David
> > >
> > >
> > > On Fri, Nov 22, 2013 at 5:53 PM, David Koch <og...@googlemail.com>
> > wrote:
> > >>
> > >>
> > >>
> > >> On Fri, Nov 22, 2013 at 4:17 PM, Ted Yu <yu...@gmail.com> wrote:
> > >>> Can you pastebin snippet of:
> > >>> 1. task logs which show failure
> > >>> 2. region server log shortly before the crash
> > >>>
> > >>> Thanks
> > >>>
> > >>>
> > >>> On Fri, Nov 22, 2013 at 7:14 AM, David Koch <og...@googlemail.com>
> > wrote:
> > >>>
> > >>> > Hello,
> > >>> >
> > >>> > We experience reliability problems when running M/R jobs over HBase
> > tables.
> > >>> > Specifically, it suffices for one Region Server to crash in order
> to
> > fail
> > >>> > all M/R jobs.
> > >>> >
> > >>> > My guess is that this is not normal with a replication factor of 3.
> > >>> >
> > >>> > The HBase version is 0.94.6 installed as part of of Cloudera 4.4.
> > HBase
> > >>> > settings are pre-sets. Cluster size is 30 machines.
> > >>> >
> > >>> > What steps can I follow to improve the situation?
> > >>> >
> > >>> > Thank you,
> > >>> >
> > >>> > /David
> > >>> >
> > >
> >
>

Re: One Region Server fails - all M/R jobs crash.

Posted by Dhaval Shah <pr...@yahoo.co.in>.
You logs suggest that you are overloading resources (servers/network/memory). How much data are you scanning with your MR job, how much are you writing back to HBase? What values are you setting for setBatch, setCaching, setCacheBlocks? How much memory do you have on your region servers? 1 server crashing should not cause a job to fail because it will move on to the next one (given the right parmas for retries and retry interval are set). Your region server logs suggest that its way more complicated than that. 

2013-11-17 09:58:37,513 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Received close for region we are already opening or closing; e54b8e16ffbe2187b9017fef596c62aa

looks like some state inconsistency issue

I also see that you are using Java 7. Though some people have had success using it, I am not sure if Java 7 is currently the recommended version (most people use Java 6!)

2013-11-18 18:01:47,959 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x342654dfdd30017, likely server has closed socket, closing socket connection and attempting reconnect

This line is suggesting a problem with your zookeeper. If zookeeper screws up, HBase will and hence your MR job over HBase will. 

2013-11-21 06:54:01,105 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /XXX.XXX.XXX.XXX:50010 for block, add to deadNodes and continue. java.net.ConnectException: Connection refused

And this suggests datanode crashed. So many processes (don't know if they belong to the same server or not) crashing at the same time seems to be a load issue or a network issue to me. 


 
Regards, 
Dhaval


________________________________
 From: David Koch <og...@googlemail.com>
To: user@hbase.apache.org 
Sent: Friday, 22 November 2013 12:35 PM
Subject: Re: One Region Server fails - all M/R jobs crash.
 

Here you go:

Task log: http://pastebin.com/VePTLHEk
Region Server log: http://pastebin.com/iu8y0VYL



On Fri, Nov 22, 2013 at 6:27 PM, Ted Yu <yu...@gmail.com> wrote:

> Attachment didn't go through.
>
> Can you pastebin their contents ?
>
> Thanks
>
> On Nov 23, 2013, at 12:55 AM, David Koch <og...@googlemail.com> wrote:
>
> > Sorry for the previous message, I attach the equired log files.
> >
> > Regards,
> >
> > David
> >
> >
> > On Fri, Nov 22, 2013 at 5:53 PM, David Koch <og...@googlemail.com>
> wrote:
> >>
> >>
> >>
> >> On Fri, Nov 22, 2013 at 4:17 PM, Ted Yu <yu...@gmail.com> wrote:
> >>> Can you pastebin snippet of:
> >>> 1. task logs which show failure
> >>> 2. region server log shortly before the crash
> >>>
> >>> Thanks
> >>>
> >>>
> >>> On Fri, Nov 22, 2013 at 7:14 AM, David Koch <og...@googlemail.com>
> wrote:
> >>>
> >>> > Hello,
> >>> >
> >>> > We experience reliability problems when running M/R jobs over HBase
> tables.
> >>> > Specifically, it suffices for one Region Server to crash in order to
> fail
> >>> > all M/R jobs.
> >>> >
> >>> > My guess is that this is not normal with a replication factor of 3.
> >>> >
> >>> > The HBase version is 0.94.6 installed as part of of Cloudera 4.4.
> HBase
> >>> > settings are pre-sets. Cluster size is 30 machines.
> >>> >
> >>> > What steps can I follow to improve the situation?
> >>> >
> >>> > Thank you,
> >>> >
> >>> > /David
> >>> >
> >
>

Re: One Region Server fails - all M/R jobs crash.

Posted by David Koch <og...@googlemail.com>.
Here you go:

Task log: http://pastebin.com/VePTLHEk
Region Server log: http://pastebin.com/iu8y0VYL


On Fri, Nov 22, 2013 at 6:27 PM, Ted Yu <yu...@gmail.com> wrote:

> Attachment didn't go through.
>
> Can you pastebin their contents ?
>
> Thanks
>
> On Nov 23, 2013, at 12:55 AM, David Koch <og...@googlemail.com> wrote:
>
> > Sorry for the previous message, I attach the equired log files.
> >
> > Regards,
> >
> > David
> >
> >
> > On Fri, Nov 22, 2013 at 5:53 PM, David Koch <og...@googlemail.com>
> wrote:
> >>
> >>
> >>
> >> On Fri, Nov 22, 2013 at 4:17 PM, Ted Yu <yu...@gmail.com> wrote:
> >>> Can you pastebin snippet of:
> >>> 1. task logs which show failure
> >>> 2. region server log shortly before the crash
> >>>
> >>> Thanks
> >>>
> >>>
> >>> On Fri, Nov 22, 2013 at 7:14 AM, David Koch <og...@googlemail.com>
> wrote:
> >>>
> >>> > Hello,
> >>> >
> >>> > We experience reliability problems when running M/R jobs over HBase
> tables.
> >>> > Specifically, it suffices for one Region Server to crash in order to
> fail
> >>> > all M/R jobs.
> >>> >
> >>> > My guess is that this is not normal with a replication factor of 3.
> >>> >
> >>> > The HBase version is 0.94.6 installed as part of of Cloudera 4.4.
> HBase
> >>> > settings are pre-sets. Cluster size is 30 machines.
> >>> >
> >>> > What steps can I follow to improve the situation?
> >>> >
> >>> > Thank you,
> >>> >
> >>> > /David
> >>> >
> >
>

Re: One Region Server fails - all M/R jobs crash.

Posted by Ted Yu <yu...@gmail.com>.
Attachment didn't go through. 

Can you pastebin their contents ?

Thanks

On Nov 23, 2013, at 12:55 AM, David Koch <og...@googlemail.com> wrote:

> Sorry for the previous message, I attach the equired log files.
> 
> Regards,
> 
> David
> 
> 
> On Fri, Nov 22, 2013 at 5:53 PM, David Koch <og...@googlemail.com> wrote:
>> 
>> 
>> 
>> On Fri, Nov 22, 2013 at 4:17 PM, Ted Yu <yu...@gmail.com> wrote:
>>> Can you pastebin snippet of:
>>> 1. task logs which show failure
>>> 2. region server log shortly before the crash
>>> 
>>> Thanks
>>> 
>>> 
>>> On Fri, Nov 22, 2013 at 7:14 AM, David Koch <og...@googlemail.com> wrote:
>>> 
>>> > Hello,
>>> >
>>> > We experience reliability problems when running M/R jobs over HBase tables.
>>> > Specifically, it suffices for one Region Server to crash in order to fail
>>> > all M/R jobs.
>>> >
>>> > My guess is that this is not normal with a replication factor of 3.
>>> >
>>> > The HBase version is 0.94.6 installed as part of of Cloudera 4.4. HBase
>>> > settings are pre-sets. Cluster size is 30 machines.
>>> >
>>> > What steps can I follow to improve the situation?
>>> >
>>> > Thank you,
>>> >
>>> > /David
>>> >
> 

Re: One Region Server fails - all M/R jobs crash.

Posted by David Koch <og...@googlemail.com>.
Sorry for the previous message, I attach the equired log files.

Regards,

David


On Fri, Nov 22, 2013 at 5:53 PM, David Koch <og...@googlemail.com> wrote:

>
>
>
> On Fri, Nov 22, 2013 at 4:17 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> Can you pastebin snippet of:
>> 1. task logs which show failure
>> 2. region server log shortly before the crash
>>
>> Thanks
>>
>>
>> On Fri, Nov 22, 2013 at 7:14 AM, David Koch <og...@googlemail.com>
>> wrote:
>>
>> > Hello,
>> >
>> > We experience reliability problems when running M/R jobs over HBase
>> tables.
>> > Specifically, it suffices for one Region Server to crash in order to
>> fail
>> > all M/R jobs.
>> >
>> > My guess is that this is not normal with a replication factor of 3.
>> >
>> > The HBase version is 0.94.6 installed as part of of Cloudera 4.4. HBase
>> > settings are pre-sets. Cluster size is 30 machines.
>> >
>> > What steps can I follow to improve the situation?
>> >
>> > Thank you,
>> >
>> > /David
>> >
>>
>
>

Re: One Region Server fails - all M/R jobs crash.

Posted by David Koch <og...@googlemail.com>.
On Fri, Nov 22, 2013 at 4:17 PM, Ted Yu <yu...@gmail.com> wrote:

> Can you pastebin snippet of:
> 1. task logs which show failure
> 2. region server log shortly before the crash
>
> Thanks
>
>
> On Fri, Nov 22, 2013 at 7:14 AM, David Koch <og...@googlemail.com> wrote:
>
> > Hello,
> >
> > We experience reliability problems when running M/R jobs over HBase
> tables.
> > Specifically, it suffices for one Region Server to crash in order to fail
> > all M/R jobs.
> >
> > My guess is that this is not normal with a replication factor of 3.
> >
> > The HBase version is 0.94.6 installed as part of of Cloudera 4.4. HBase
> > settings are pre-sets. Cluster size is 30 machines.
> >
> > What steps can I follow to improve the situation?
> >
> > Thank you,
> >
> > /David
> >
>

Re: One Region Server fails - all M/R jobs crash.

Posted by Ted Yu <yu...@gmail.com>.
Can you pastebin snippet of:
1. task logs which show failure
2. region server log shortly before the crash

Thanks


On Fri, Nov 22, 2013 at 7:14 AM, David Koch <og...@googlemail.com> wrote:

> Hello,
>
> We experience reliability problems when running M/R jobs over HBase tables.
> Specifically, it suffices for one Region Server to crash in order to fail
> all M/R jobs.
>
> My guess is that this is not normal with a replication factor of 3.
>
> The HBase version is 0.94.6 installed as part of of Cloudera 4.4. HBase
> settings are pre-sets. Cluster size is 30 machines.
>
> What steps can I follow to improve the situation?
>
> Thank you,
>
> /David
>