You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Flavio Pompermaier <po...@okkam.it> on 2014/04/04 12:08:53 UTC

HBase cluster design

Hi to everybody,

I have a probably stupid question: is it a problem to run many mapreduce
jobs on the same HBase table at the same time? And multiple jobs on
different tables on the same cluster?
Should I use Hoya to have a better cluster usage..?

In my current cluster I noticed that the region servers tend to go down if
I run a mapreduce job while updating (maybe it could be related to the old
version of HBase I'm currently running: 0.92.1-cdh4.1.2).

Best,
Flavio

Re: HBase cluster design

Posted by Flavio Pompermaier <po...@okkam.it>.

Thanks again for the tips! And what about cache blocks? Why should I avoid
it?

On Wed, May 28, 2014 at 2:57 PM, Vikram Singh Chandel <
vikramsinghchandel@gmail.com> wrote:

> Hi Flavio
>
> I suppose you are using Cloudera Manager for HBase management, to change
> RAM usage cloudera has different steps:
>
> 1: Go to HBase service
> 2: Click on configurations
> 3: Click on View and Edit
> 4: Search for Environment safety valve
> 5: make following entry there "HBASE_HEAPSIZE=3072" (without quotes)
> 6: Save -> deploy client configurations (under Action tab) -> Restart HBase
>
>
> On Wed, May 28, 2014 at 12:35 PM, Flavio Pompermaier
> <po...@okkam.it>wrote:
>
> > Thank you all for the great suggestions, I'll try ASAP to test them.
> Just 2
> > questions:
> > - why should I set setCacheBlocks to false?
> > - How cai I increasing/decreasing the amount of RAM you provide to block
> > caches and memstores?
> >
> > Best,
> > Flavio
> >
> >
> > On Tue, May 27, 2014 at 2:54 AM, Henry Hung <YT...@winbond.com> wrote:
> >
> > > @Flavio:
> > > One thing you should consider: is CPU over saturated?
> > >
> > > For instance, if one server has 24 cores, and the task tracker is
> > > configured to also execute 24 MR simultaneously, then there is no core
> > left
> > > for HBase to do GC, and then crash.
> > >
> > > Few weeks ago my region server sometimes crash when MR is running,
> then I
> > > decide to move MR cluster to another machine leaving only DataNode +
> > > RegionServer running.
> > > After change, my regionserver still running without crash.
> > >
> > > My suggestion is you could try to decrease the MR task numbers to
> around
> > > 12 or lesser, and see if the frequency of crash decrease?
> > >
> > > Best regards,
> > > Henry Hung
> > >
> > > -----Original Message-----
> > > From: Dhaval Shah [mailto:prince_mithibai@yahoo.co.in]
> > > Sent: Tuesday, May 27, 2014 8:03 AM
> > > To: user@hbase.apache.org
> > > Subject: Re: HBase cluster design
> > >
> > > A few things pop out to me on cursory glance:
> > > - You are using CMSIncrementalMode which after a long chain of events
> has
> > > a tendency to result in the famous Juliet pause of death. Can you try
> Par
> > > New GC instead and see if that helps?
> > > - You should try to reduce the CMSInitiatingOccupancyFraction to avoid
> a
> > > full GC
> > > - Your hbase-env.sh is not setting the Xmx at all. Do you know how much
> > > RAM you are giving to your region servers? It may be too small or too
> > large
> > > given your use case and machines size
> > > - Your client scanner caching is 1 which may be too small depending on
> > > your row sizes. You can also override that setting in your scan for the
> > MR
> > > job
> > > - You only have 2 zookeeper instances which is not at all recommended.
> > > Zookeeper needs a quorum to operate and generally works best with an
> odd
> > > number of zookeeper servers. This probably isn't related to your
> crashes
> > > but it would help stability if you had 1 or 3 zookeepers
> > > - I am not 100% sure if the version of hbase you are using has mslab
> > > enabled. If not you should enable it.
> > > - You can try increasing/decreasing the amount of RAM you provide to
> > block
> > > caches and memstores to suit your use case. I see that you are using
> the
> > > defaults here
> > >
> > > On top of these, when you kick off your MR job to scan HBase you should
> > > setCacheBlocks to false
> > >
> > >
> > >
> > > Regards,
> > > Dhaval
> > >
> > >
> > > ________________________________
> > >  From: Flavio Pompermaier <po...@okkam.it>
> > > To: user@hbase.apache.org; Dhaval Shah <pr...@yahoo.co.in>
> > > Sent: Friday, 23 May 2014 3:16 AM
> > > Subject: Re: HBase cluster design
> > >
> > >
> > >
> > > The hardware specs are: 4 nodes with 48g RAM, 24 cores and 1 TB disk
> each
> > > server
> > > Attached my hbase config files.
> > >
> > > Thanks,
> > > Flavio
> > >
> > >
> > >
> > >
> > >
> > > On Fri, May 23, 2014 at 3:33 AM, Dhaval Shah <
> > prince_mithibai@yahoo.co.in>
> > > wrote:
> > >
> > > Can you share your hbase-env.sh and hbase-site.xml? And hardware specs
> of
> > > your cluster?
> > > >
> > > >
> > > >
> > > >
> > > >Regards,
> > > >Dhaval
> > > >
> > > >
> > > >________________________________
> > > > From: Flavio Pompermaier <po...@okkam.it>
> > > >To: user@hbase.apache.org
> > > >Sent: Saturday, 17 May 2014 2:49 AM
> > > >Subject: Re: HBase cluster design
> > > >
> > > >
> > > >
> > > >Could you tell me please in detail the parameters you'd like to see
> so i
> > > >can look for them and learn the important ones?i'm using cloudera,
> cdh4
> > in
> > > >one cluster and cdh5 in the other.
> > > >
> > > >Best,
> > > >Flavio
> > > >
> > > >On May 17, 2014 2:48 AM, "prince_mithibai@yahoo.co.in" <
> > > >prince_mithibai@yahoo.co.in> wrote:
> > > >
> > > >> Can you describe your setup in more detail? Specifically the amount
> of
> > > >> heap hbase region servers have and your GC settings. Is your server
> > > >> swapping when your MR obs are running? Also do your regions go down
> or
> > > your
> > > >> region servers?
> > > >>
> > > >> We run many MR jobs simultaneously on our hbase tables (size is in
> > TBs)
> > > >> along with serving real time requests at the same time. So I can
> vouch
> > > for
> > > >> the fact that a well tuned hbase cluster definitely supports this
> use
> > > case
> > > >> (well-tuned is the key word here)
> > > >>
> > > >> Sent from Yahoo Mail on Android
> > > >>
> > > >>
> > >
> > > The privileged confidential information contained in this email is
> > > intended for use only by the addressees as indicated by the original
> > sender
> > > of this email. If you are not the addressee indicated in this email or
> > are
> > > not responsible for delivery of the email to such a person, please
> kindly
> > > reply to the sender indicating this fact and delete all copies of it
> from
> > > your computer and network server immediately. Your cooperation is
> highly
> > > appreciated. It is advised that any unauthorized use of confidential
> > > information of Winbond is strictly prohibited; and any information in
> > this
> > > email irrelevant to the official business of Winbond shall be deemed as
> > > neither given nor endorsed by Winbond.
> > >
> >
>
>
>
> --
> *Regards*
>
> *VIKRAM SINGH CHANDEL*
>
> Please do not print this email unless it is absolutely necessary,Reduce.
> Reuse. Recycle. Save our planet.

Re: HBase cluster design

Posted by Vikram Singh Chandel <vi...@gmail.com>.

Hi Flavio

I suppose you are using Cloudera Manager for HBase management, to change
RAM usage cloudera has different steps:

1: Go to HBase service
2: Click on configurations
3: Click on View and Edit
4: Search for Environment safety valve
5: make following entry there "HBASE_HEAPSIZE=3072" (without quotes)
6: Save -> deploy client configurations (under Action tab) -> Restart HBase


On Wed, May 28, 2014 at 12:35 PM, Flavio Pompermaier
<po...@okkam.it>wrote:

> Thank you all for the great suggestions, I'll try ASAP to test them. Just 2
> questions:
> - why should I set setCacheBlocks to false?
> - How cai I increasing/decreasing the amount of RAM you provide to block
> caches and memstores?
>
> Best,
> Flavio
>
>
> On Tue, May 27, 2014 at 2:54 AM, Henry Hung <YT...@winbond.com> wrote:
>
> > @Flavio:
> > One thing you should consider: is CPU over saturated?
> >
> > For instance, if one server has 24 cores, and the task tracker is
> > configured to also execute 24 MR simultaneously, then there is no core
> left
> > for HBase to do GC, and then crash.
> >
> > Few weeks ago my region server sometimes crash when MR is running, then I
> > decide to move MR cluster to another machine leaving only DataNode +
> > RegionServer running.
> > After change, my regionserver still running without crash.
> >
> > My suggestion is you could try to decrease the MR task numbers to around
> > 12 or lesser, and see if the frequency of crash decrease?
> >
> > Best regards,
> > Henry Hung
> >
> > -----Original Message-----
> > From: Dhaval Shah [mailto:prince_mithibai@yahoo.co.in]
> > Sent: Tuesday, May 27, 2014 8:03 AM
> > To: user@hbase.apache.org
> > Subject: Re: HBase cluster design
> >
> > A few things pop out to me on cursory glance:
> > - You are using CMSIncrementalMode which after a long chain of events has
> > a tendency to result in the famous Juliet pause of death. Can you try Par
> > New GC instead and see if that helps?
> > - You should try to reduce the CMSInitiatingOccupancyFraction to avoid a
> > full GC
> > - Your hbase-env.sh is not setting the Xmx at all. Do you know how much
> > RAM you are giving to your region servers? It may be too small or too
> large
> > given your use case and machines size
> > - Your client scanner caching is 1 which may be too small depending on
> > your row sizes. You can also override that setting in your scan for the
> MR
> > job
> > - You only have 2 zookeeper instances which is not at all recommended.
> > Zookeeper needs a quorum to operate and generally works best with an odd
> > number of zookeeper servers. This probably isn't related to your crashes
> > but it would help stability if you had 1 or 3 zookeepers
> > - I am not 100% sure if the version of hbase you are using has mslab
> > enabled. If not you should enable it.
> > - You can try increasing/decreasing the amount of RAM you provide to
> block
> > caches and memstores to suit your use case. I see that you are using the
> > defaults here
> >
> > On top of these, when you kick off your MR job to scan HBase you should
> > setCacheBlocks to false
> >
> >
> >
> > Regards,
> > Dhaval
> >
> >
> > ________________________________
> >  From: Flavio Pompermaier <po...@okkam.it>
> > To: user@hbase.apache.org; Dhaval Shah <pr...@yahoo.co.in>
> > Sent: Friday, 23 May 2014 3:16 AM
> > Subject: Re: HBase cluster design
> >
> >
> >
> > The hardware specs are: 4 nodes with 48g RAM, 24 cores and 1 TB disk each
> > server
> > Attached my hbase config files.
> >
> > Thanks,
> > Flavio
> >
> >
> >
> >
> >
> > On Fri, May 23, 2014 at 3:33 AM, Dhaval Shah <
> prince_mithibai@yahoo.co.in>
> > wrote:
> >
> > Can you share your hbase-env.sh and hbase-site.xml? And hardware specs of
> > your cluster?
> > >
> > >
> > >
> > >
> > >Regards,
> > >Dhaval
> > >
> > >
> > >________________________________
> > > From: Flavio Pompermaier <po...@okkam.it>
> > >To: user@hbase.apache.org
> > >Sent: Saturday, 17 May 2014 2:49 AM
> > >Subject: Re: HBase cluster design
> > >
> > >
> > >
> > >Could you tell me please in detail the parameters you'd like to see so i
> > >can look for them and learn the important ones?i'm using cloudera, cdh4
> in
> > >one cluster and cdh5 in the other.
> > >
> > >Best,
> > >Flavio
> > >
> > >On May 17, 2014 2:48 AM, "prince_mithibai@yahoo.co.in" <
> > >prince_mithibai@yahoo.co.in> wrote:
> > >
> > >> Can you describe your setup in more detail? Specifically the amount of
> > >> heap hbase region servers have and your GC settings. Is your server
> > >> swapping when your MR obs are running? Also do your regions go down or
> > your
> > >> region servers?
> > >>
> > >> We run many MR jobs simultaneously on our hbase tables (size is in
> TBs)
> > >> along with serving real time requests at the same time. So I can vouch
> > for
> > >> the fact that a well tuned hbase cluster definitely supports this use
> > case
> > >> (well-tuned is the key word here)
> > >>
> > >> Sent from Yahoo Mail on Android
> > >>
> > >>
> >
> > The privileged confidential information contained in this email is
> > intended for use only by the addressees as indicated by the original
> sender
> > of this email. If you are not the addressee indicated in this email or
> are
> > not responsible for delivery of the email to such a person, please kindly
> > reply to the sender indicating this fact and delete all copies of it from
> > your computer and network server immediately. Your cooperation is highly
> > appreciated. It is advised that any unauthorized use of confidential
> > information of Winbond is strictly prohibited; and any information in
> this
> > email irrelevant to the official business of Winbond shall be deemed as
> > neither given nor endorsed by Winbond.
> >
>



-- 
*Regards*

*VIKRAM SINGH CHANDEL*

Please do not print this email unless it is absolutely necessary,Reduce.
Reuse. Recycle. Save our planet.

Re: HBase cluster design

Posted by Ted Yu <yu...@gmail.com>.

For #2, see http://hbase.apache.org/book/perf.configurations.html

Relevant config parameters start with 12.4.3

Cheers

On May 28, 2014, at 12:05 AM, Flavio Pompermaier <po...@okkam.it> wrote:

> Thank you all for the great suggestions, I'll try ASAP to test them. Just 2
> questions:
> - why should I set setCacheBlocks to false?
> - How cai I increasing/decreasing the amount of RAM you provide to block
> caches and memstores?
> 
> Best,
> Flavio
> 
> 
> On Tue, May 27, 2014 at 2:54 AM, Henry Hung <YT...@winbond.com> wrote:
> 
>> @Flavio:
>> One thing you should consider: is CPU over saturated?
>> 
>> For instance, if one server has 24 cores, and the task tracker is
>> configured to also execute 24 MR simultaneously, then there is no core left
>> for HBase to do GC, and then crash.
>> 
>> Few weeks ago my region server sometimes crash when MR is running, then I
>> decide to move MR cluster to another machine leaving only DataNode +
>> RegionServer running.
>> After change, my regionserver still running without crash.
>> 
>> My suggestion is you could try to decrease the MR task numbers to around
>> 12 or lesser, and see if the frequency of crash decrease?
>> 
>> Best regards,
>> Henry Hung
>> 
>> -----Original Message-----
>> From: Dhaval Shah [mailto:prince_mithibai@yahoo.co.in]
>> Sent: Tuesday, May 27, 2014 8:03 AM
>> To: user@hbase.apache.org
>> Subject: Re: HBase cluster design
>> 
>> A few things pop out to me on cursory glance:
>> - You are using CMSIncrementalMode which after a long chain of events has
>> a tendency to result in the famous Juliet pause of death. Can you try Par
>> New GC instead and see if that helps?
>> - You should try to reduce the CMSInitiatingOccupancyFraction to avoid a
>> full GC
>> - Your hbase-env.sh is not setting the Xmx at all. Do you know how much
>> RAM you are giving to your region servers? It may be too small or too large
>> given your use case and machines size
>> - Your client scanner caching is 1 which may be too small depending on
>> your row sizes. You can also override that setting in your scan for the MR
>> job
>> - You only have 2 zookeeper instances which is not at all recommended.
>> Zookeeper needs a quorum to operate and generally works best with an odd
>> number of zookeeper servers. This probably isn't related to your crashes
>> but it would help stability if you had 1 or 3 zookeepers
>> - I am not 100% sure if the version of hbase you are using has mslab
>> enabled. If not you should enable it.
>> - You can try increasing/decreasing the amount of RAM you provide to block
>> caches and memstores to suit your use case. I see that you are using the
>> defaults here
>> 
>> On top of these, when you kick off your MR job to scan HBase you should
>> setCacheBlocks to false
>> 
>> 
>> 
>> Regards,
>> Dhaval
>> 
>> 
>> ________________________________
>> From: Flavio Pompermaier <po...@okkam.it>
>> To: user@hbase.apache.org; Dhaval Shah <pr...@yahoo.co.in>
>> Sent: Friday, 23 May 2014 3:16 AM
>> Subject: Re: HBase cluster design
>> 
>> 
>> 
>> The hardware specs are: 4 nodes with 48g RAM, 24 cores and 1 TB disk each
>> server
>> Attached my hbase config files.
>> 
>> Thanks,
>> Flavio
>> 
>> 
>> 
>> 
>> 
>> On Fri, May 23, 2014 at 3:33 AM, Dhaval Shah <pr...@yahoo.co.in>
>> wrote:
>> 
>> Can you share your hbase-env.sh and hbase-site.xml? And hardware specs of
>> your cluster?
>>> 
>>> 
>>> 
>>> 
>>> Regards,
>>> Dhaval
>>> 
>>> 
>>> ________________________________
>>> From: Flavio Pompermaier <po...@okkam.it>
>>> To: user@hbase.apache.org
>>> Sent: Saturday, 17 May 2014 2:49 AM
>>> Subject: Re: HBase cluster design
>>> 
>>> 
>>> 
>>> Could you tell me please in detail the parameters you'd like to see so i
>>> can look for them and learn the important ones?i'm using cloudera, cdh4 in
>>> one cluster and cdh5 in the other.
>>> 
>>> Best,
>>> Flavio
>>> 
>>> On May 17, 2014 2:48 AM, "prince_mithibai@yahoo.co.in" <
>>> prince_mithibai@yahoo.co.in> wrote:
>>> 
>>>> Can you describe your setup in more detail? Specifically the amount of
>>>> heap hbase region servers have and your GC settings. Is your server
>>>> swapping when your MR obs are running? Also do your regions go down or
>> your
>>>> region servers?
>>>> 
>>>> We run many MR jobs simultaneously on our hbase tables (size is in TBs)
>>>> along with serving real time requests at the same time. So I can vouch
>> for
>>>> the fact that a well tuned hbase cluster definitely supports this use
>> case
>>>> (well-tuned is the key word here)
>>>> 
>>>> Sent from Yahoo Mail on Android
>> 
>> The privileged confidential information contained in this email is
>> intended for use only by the addressees as indicated by the original sender
>> of this email. If you are not the addressee indicated in this email or are
>> not responsible for delivery of the email to such a person, please kindly
>> reply to the sender indicating this fact and delete all copies of it from
>> your computer and network server immediately. Your cooperation is highly
>> appreciated. It is advised that any unauthorized use of confidential
>> information of Winbond is strictly prohibited; and any information in this
>> email irrelevant to the official business of Winbond shall be deemed as
>> neither given nor endorsed by Winbond.
>>

Re: HBase cluster design

Posted by "prince_mithibai@yahoo.co.in" <pr...@yahoo.co.in>.

Setting cache blocks to false while running an MR Jon serves 2 purposes - it helps real time requests by not out caches for MR jobs which don't really need/use them, it helps prevent churn/fragmentation in block cache which helps GC

There are properties you need to set in hbase-site.xml for that

Sent from Yahoo Mail on Android

Re: HBase cluster design

Posted by Flavio Pompermaier <po...@okkam.it>.

Thank you all for the great suggestions, I'll try ASAP to test them. Just 2
questions:
- why should I set setCacheBlocks to false?
- How cai I increasing/decreasing the amount of RAM you provide to block
caches and memstores?

Best,
Flavio


On Tue, May 27, 2014 at 2:54 AM, Henry Hung <YT...@winbond.com> wrote:

> @Flavio:
> One thing you should consider: is CPU over saturated?
>
> For instance, if one server has 24 cores, and the task tracker is
> configured to also execute 24 MR simultaneously, then there is no core left
> for HBase to do GC, and then crash.
>
> Few weeks ago my region server sometimes crash when MR is running, then I
> decide to move MR cluster to another machine leaving only DataNode +
> RegionServer running.
> After change, my regionserver still running without crash.
>
> My suggestion is you could try to decrease the MR task numbers to around
> 12 or lesser, and see if the frequency of crash decrease?
>
> Best regards,
> Henry Hung
>
> -----Original Message-----
> From: Dhaval Shah [mailto:prince_mithibai@yahoo.co.in]
> Sent: Tuesday, May 27, 2014 8:03 AM
> To: user@hbase.apache.org
> Subject: Re: HBase cluster design
>
> A few things pop out to me on cursory glance:
> - You are using CMSIncrementalMode which after a long chain of events has
> a tendency to result in the famous Juliet pause of death. Can you try Par
> New GC instead and see if that helps?
> - You should try to reduce the CMSInitiatingOccupancyFraction to avoid a
> full GC
> - Your hbase-env.sh is not setting the Xmx at all. Do you know how much
> RAM you are giving to your region servers? It may be too small or too large
> given your use case and machines size
> - Your client scanner caching is 1 which may be too small depending on
> your row sizes. You can also override that setting in your scan for the MR
> job
> - You only have 2 zookeeper instances which is not at all recommended.
> Zookeeper needs a quorum to operate and generally works best with an odd
> number of zookeeper servers. This probably isn't related to your crashes
> but it would help stability if you had 1 or 3 zookeepers
> - I am not 100% sure if the version of hbase you are using has mslab
> enabled. If not you should enable it.
> - You can try increasing/decreasing the amount of RAM you provide to block
> caches and memstores to suit your use case. I see that you are using the
> defaults here
>
> On top of these, when you kick off your MR job to scan HBase you should
> setCacheBlocks to false
>
>
>
> Regards,
> Dhaval
>
>
> ________________________________
>  From: Flavio Pompermaier <po...@okkam.it>
> To: user@hbase.apache.org; Dhaval Shah <pr...@yahoo.co.in>
> Sent: Friday, 23 May 2014 3:16 AM
> Subject: Re: HBase cluster design
>
>
>
> The hardware specs are: 4 nodes with 48g RAM, 24 cores and 1 TB disk each
> server
> Attached my hbase config files.
>
> Thanks,
> Flavio
>
>
>
>
>
> On Fri, May 23, 2014 at 3:33 AM, Dhaval Shah <pr...@yahoo.co.in>
> wrote:
>
> Can you share your hbase-env.sh and hbase-site.xml? And hardware specs of
> your cluster?
> >
> >
> >
> >
> >Regards,
> >Dhaval
> >
> >
> >________________________________
> > From: Flavio Pompermaier <po...@okkam.it>
> >To: user@hbase.apache.org
> >Sent: Saturday, 17 May 2014 2:49 AM
> >Subject: Re: HBase cluster design
> >
> >
> >
> >Could you tell me please in detail the parameters you'd like to see so i
> >can look for them and learn the important ones?i'm using cloudera, cdh4 in
> >one cluster and cdh5 in the other.
> >
> >Best,
> >Flavio
> >
> >On May 17, 2014 2:48 AM, "prince_mithibai@yahoo.co.in" <
> >prince_mithibai@yahoo.co.in> wrote:
> >
> >> Can you describe your setup in more detail? Specifically the amount of
> >> heap hbase region servers have and your GC settings. Is your server
> >> swapping when your MR obs are running? Also do your regions go down or
> your
> >> region servers?
> >>
> >> We run many MR jobs simultaneously on our hbase tables (size is in TBs)
> >> along with serving real time requests at the same time. So I can vouch
> for
> >> the fact that a well tuned hbase cluster definitely supports this use
> case
> >> (well-tuned is the key word here)
> >>
> >> Sent from Yahoo Mail on Android
> >>
> >>
>
> The privileged confidential information contained in this email is
> intended for use only by the addressees as indicated by the original sender
> of this email. If you are not the addressee indicated in this email or are
> not responsible for delivery of the email to such a person, please kindly
> reply to the sender indicating this fact and delete all copies of it from
> your computer and network server immediately. Your cooperation is highly
> appreciated. It is advised that any unauthorized use of confidential
> information of Winbond is strictly prohibited; and any information in this
> email irrelevant to the official business of Winbond shall be deemed as
> neither given nor endorsed by Winbond.
>

RE: HBase cluster design

Posted by Henry Hung <YT...@winbond.com>.

@Flavio:
One thing you should consider: is CPU over saturated?

For instance, if one server has 24 cores, and the task tracker is configured to also execute 24 MR simultaneously, then there is no core left for HBase to do GC, and then crash.

Few weeks ago my region server sometimes crash when MR is running, then I decide to move MR cluster to another machine leaving only DataNode + RegionServer running.
After change, my regionserver still running without crash.

My suggestion is you could try to decrease the MR task numbers to around 12 or lesser, and see if the frequency of crash decrease?

Best regards,
Henry Hung

-----Original Message-----
From: Dhaval Shah [mailto:prince_mithibai@yahoo.co.in]
Sent: Tuesday, May 27, 2014 8:03 AM
To: user@hbase.apache.org
Subject: Re: HBase cluster design

A few things pop out to me on cursory glance:
- You are using CMSIncrementalMode which after a long chain of events has a tendency to result in the famous Juliet pause of death. Can you try Par New GC instead and see if that helps?
- You should try to reduce the CMSInitiatingOccupancyFraction to avoid a full GC
- Your hbase-env.sh is not setting the Xmx at all. Do you know how much RAM you are giving to your region servers? It may be too small or too large given your use case and machines size
- Your client scanner caching is 1 which may be too small depending on your row sizes. You can also override that setting in your scan for the MR job
- You only have 2 zookeeper instances which is not at all recommended. Zookeeper needs a quorum to operate and generally works best with an odd number of zookeeper servers. This probably isn't related to your crashes but it would help stability if you had 1 or 3 zookeepers
- I am not 100% sure if the version of hbase you are using has mslab enabled. If not you should enable it.
- You can try increasing/decreasing the amount of RAM you provide to block caches and memstores to suit your use case. I see that you are using the defaults here

On top of these, when you kick off your MR job to scan HBase you should setCacheBlocks to false

Regards,
Dhaval

________________________________
 From: Flavio Pompermaier <po...@okkam.it>
To: user@hbase.apache.org; Dhaval Shah <pr...@yahoo.co.in>
Sent: Friday, 23 May 2014 3:16 AM
Subject: Re: HBase cluster design

The hardware specs are: 4 nodes with 48g RAM, 24 cores and 1 TB disk each server
Attached my hbase config files.

Thanks,
Flavio

On Fri, May 23, 2014 at 3:33 AM, Dhaval Shah <pr...@yahoo.co.in> wrote:

Can you share your hbase-env.sh and hbase-site.xml? And hardware specs of your cluster?
>
>
>
>
>Regards,
>Dhaval
>
>
>________________________________
> From: Flavio Pompermaier <po...@okkam.it>
>To: user@hbase.apache.org
>Sent: Saturday, 17 May 2014 2:49 AM
>Subject: Re: HBase cluster design
>
>
>
>Could you tell me please in detail the parameters you'd like to see so i
>can look for them and learn the important ones?i'm using cloudera, cdh4 in
>one cluster and cdh5 in the other.
>
>Best,
>Flavio
>
>On May 17, 2014 2:48 AM, "prince_mithibai@yahoo.co.in" <
>prince_mithibai@yahoo.co.in> wrote:
>
>> Can you describe your setup in more detail? Specifically the amount of
>> heap hbase region servers have and your GC settings. Is your server
>> swapping when your MR obs are running? Also do your regions go down or your
>> region servers?
>>
>> We run many MR jobs simultaneously on our hbase tables (size is in TBs)
>> along with serving real time requests at the same time. So I can vouch for
>> the fact that a well tuned hbase cluster definitely supports this use case
>> (well-tuned is the key word here)
>>
>> Sent from Yahoo Mail on Android
>>
>>

The privileged confidential information contained in this email is intended for use only by the addressees as indicated by the original sender of this email. If you are not the addressee indicated in this email or are not responsible for delivery of the email to such a person, please kindly reply to the sender indicating this fact and delete all copies of it from your computer and network server immediately. Your cooperation is highly appreciated. It is advised that any unauthorized use of confidential information of Winbond is strictly prohibited; and any information in this email irrelevant to the official business of Winbond shall be deemed as neither given nor endorsed by Winbond.

Re: HBase cluster design

Posted by Dhaval Shah <pr...@yahoo.co.in>.

A few things pop out to me on cursory glance:
- You are using CMSIncrementalMode which after a long chain of events has a tendency to result in the famous Juliet pause of death. Can you try Par New GC instead and see if that helps?
- You should try to reduce the CMSInitiatingOccupancyFraction to avoid a full GC
- Your hbase-env.sh is not setting the Xmx at all. Do you know how much RAM you are giving to your region servers? It may be too small or too large given your use case and machines size
- Your client scanner caching is 1 which may be too small depending on your row sizes. You can also override that setting in your scan for the MR job
- You only have 2 zookeeper instances which is not at all recommended. Zookeeper needs a quorum to operate and generally works best with an odd number of zookeeper servers. This probably isn't related to your crashes but it would help stability if you had 1 or 3 zookeepers
- I am not 100% sure if the version of hbase you are using has mslab enabled. If not you should enable it.
- You can try increasing/decreasing the amount of RAM you provide to block caches and memstores to suit your use case. I see that you are using the defaults here

On top of these, when you kick off your MR job to scan HBase you should setCacheBlocks to false



Regards,
Dhaval
 

________________________________
 From: Flavio Pompermaier <po...@okkam.it>
To: user@hbase.apache.org; Dhaval Shah <pr...@yahoo.co.in> 
Sent: Friday, 23 May 2014 3:16 AM
Subject: Re: HBase cluster design
  


The hardware specs are: 4 nodes with 48g RAM, 24 cores and 1 TB disk each server
Attached my hbase config files. 

Thanks,
Flavio





On Fri, May 23, 2014 at 3:33 AM, Dhaval Shah <pr...@yahoo.co.in> wrote:

Can you share your hbase-env.sh and hbase-site.xml? And hardware specs of your cluster?
>
>
> 
>
>Regards,
>Dhaval
>
>
>________________________________
> From: Flavio Pompermaier <po...@okkam.it>
>To: user@hbase.apache.org
>Sent: Saturday, 17 May 2014 2:49 AM
>Subject: Re: HBase cluster design
>
>
>
>Could you tell me please in detail the parameters you'd like to see so i
>can look for them and learn the important ones?i'm using cloudera, cdh4 in
>one cluster and cdh5 in the other.
>
>Best,
>Flavio
>
>On May 17, 2014 2:48 AM, "prince_mithibai@yahoo.co.in" <
>prince_mithibai@yahoo.co.in> wrote:
>
>> Can you describe your setup in more detail? Specifically the amount of
>> heap hbase region servers have and your GC settings. Is your server
>> swapping when your MR obs are running? Also do your regions go down or your
>> region servers?
>>
>> We run many MR jobs simultaneously on our hbase tables (size is in TBs)
>> along with serving real time requests at the same time. So I can vouch for
>> the fact that a well tuned hbase cluster definitely supports this use case
>> (well-tuned is the key word here)
>>
>> Sent from Yahoo Mail on Android
>>
>>

Re: HBase cluster design

Posted by Flavio Pompermaier <po...@okkam.it>.

The hardware specs are: 4 nodes with 48g RAM, 24 cores and 1 TB disk each
server
Attached my hbase config files.

Thanks,
Flavio

On Fri, May 23, 2014 at 3:33 AM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> Can you share your hbase-env.sh and hbase-site.xml? And hardware specs of
> your cluster?
>
>
>
>
> Regards,
> Dhaval
>
>
> ________________________________
>  From: Flavio Pompermaier <po...@okkam.it>
> To: user@hbase.apache.org
> Sent: Saturday, 17 May 2014 2:49 AM
> Subject: Re: HBase cluster design
>
>
> Could you tell me please in detail the parameters you'd like to see so i
> can look for them and learn the important ones?i'm using cloudera, cdh4 in
> one cluster and cdh5 in the other.
>
> Best,
> Flavio
>
> On May 17, 2014 2:48 AM, "prince_mithibai@yahoo.co.in" <
> prince_mithibai@yahoo.co.in> wrote:
>
> > Can you describe your setup in more detail? Specifically the amount of
> > heap hbase region servers have and your GC settings. Is your server
> > swapping when your MR obs are running? Also do your regions go down or
> your
> > region servers?
> >
> > We run many MR jobs simultaneously on our hbase tables (size is in TBs)
> > along with serving real time requests at the same time. So I can vouch
> for
> > the fact that a well tuned hbase cluster definitely supports this use
> case
> > (well-tuned is the key word here)
> >
> > Sent from Yahoo Mail on Android
> >
> >
>

Re: HBase cluster design

Posted by Dhaval Shah <pr...@yahoo.co.in>.

Can you share your hbase-env.sh and hbase-site.xml? And hardware specs of your cluster?


 

Regards,
Dhaval


________________________________
 From: Flavio Pompermaier <po...@okkam.it>
To: user@hbase.apache.org 
Sent: Saturday, 17 May 2014 2:49 AM
Subject: Re: HBase cluster design
 

Could you tell me please in detail the parameters you'd like to see so i
can look for them and learn the important ones?i'm using cloudera, cdh4 in
one cluster and cdh5 in the other.

Best,
Flavio

On May 17, 2014 2:48 AM, "prince_mithibai@yahoo.co.in" <
prince_mithibai@yahoo.co.in> wrote:

> Can you describe your setup in more detail? Specifically the amount of
> heap hbase region servers have and your GC settings. Is your server
> swapping when your MR obs are running? Also do your regions go down or your
> region servers?
>
> We run many MR jobs simultaneously on our hbase tables (size is in TBs)
> along with serving real time requests at the same time. So I can vouch for
> the fact that a well tuned hbase cluster definitely supports this use case
> (well-tuned is the key word here)
>
> Sent from Yahoo Mail on Android
>
>

Re: HBase cluster design

Posted by Flavio Pompermaier <po...@okkam.it>.

Could you tell me please in detail the parameters you'd like to see so i
can look for them and learn the important ones?i'm using cloudera, cdh4 in
one cluster and cdh5 in the other.

Best,
Flavio
On May 17, 2014 2:48 AM, "prince_mithibai@yahoo.co.in" <
prince_mithibai@yahoo.co.in> wrote:

> Can you describe your setup in more detail? Specifically the amount of
> heap hbase region servers have and your GC settings. Is your server
> swapping when your MR obs are running? Also do your regions go down or your
> region servers?
>
> We run many MR jobs simultaneously on our hbase tables (size is in TBs)
> along with serving real time requests at the same time. So I can vouch for
> the fact that a well tuned hbase cluster definitely supports this use case
> (well-tuned is the key word here)
>
> Sent from Yahoo Mail on Android
>
>

Re: HBase cluster design

Posted by "prince_mithibai@yahoo.co.in" <pr...@yahoo.co.in>.

Can you describe your setup in more detail? Specifically the amount of heap hbase region servers have and your GC settings. Is your server swapping when your MR obs are running? Also do your regions go down or your region servers?

We run many MR jobs simultaneously on our hbase tables (size is in TBs) along with serving real time requests at the same time. So I can vouch for the fact that a well tuned hbase cluster definitely supports this use case (well-tuned is the key word here)

Sent from Yahoo Mail on Android

Re: HBase cluster design

Posted by Flavio Pompermaier <po...@okkam.it>.

Thanks for the response. Actually I'm still trying to understand why some
of the regions of my Hbase goes down from time to time during my mapred job
if table updates occur because in the logs there's nothing interesting..The
updates usually happens in bursts of 10/100 sequential puts per seconds. Is
there any rule of thumb about those scenarios to avoid problems or some
fundamental tuning to check? I have a lot of ram (48g) and 24 processor per
server (for a total  of 4 servers) and I have not that much data (20g) so I
don't understand why the region servers goes down (usually after a couple
of mapred job).
However in general, speaking also with other people using Hbase, it seems
that is not very safe to run mapred jobs while updating the table..are we
wrong?

Best,
Flavio

On Wed, May 14, 2014 at 7:04 PM, Stack <st...@duboce.net> wrote:

> On Tue, May 13, 2014 at 3:14 AM, Flavio Pompermaier <pompermaier@okkam.it
> >wrote:
>
> > So just to summarize the result of this discussion..
> > do you confirm me that the last version of HBase should (in theory)
> support
> > mapreduce jobs on tables that in the meantime could be updated by
> external
> > processes (i.e. not by the mapred job)?
> > One of the answer about this was saying: "Poorly tuned HBase clusters can
> > fail easily under heavy load"..
> > Could you suggest me some tuning to avoid the crashing of HBase in such
> > situations?
> >
>
>
> Run less mappers/reducers.  Start with one only and move up from there.
>  Ditto for other processes updating hbase.
>
> You have monitoring going on on this cluster?  What is it telling you about
> the loadings?
>
> St.Ack

Re: HBase cluster design

Posted by Stack <st...@duboce.net>.

On Tue, May 13, 2014 at 3:14 AM, Flavio Pompermaier <po...@okkam.it>wrote:

> So just to summarize the result of this discussion..
> do you confirm me that the last version of HBase should (in theory) support
> mapreduce jobs on tables that in the meantime could be updated by external
> processes (i.e. not by the mapred job)?
> One of the answer about this was saying: "Poorly tuned HBase clusters can
> fail easily under heavy load"..
> Could you suggest me some tuning to avoid the crashing of HBase in such
> situations?
>

Run less mappers/reducers.  Start with one only and move up from there.
 Ditto for other processes updating hbase.

You have monitoring going on on this cluster?  What is it telling you about
the loadings?

St.Ack

Re: HBase cluster design

Posted by Flavio Pompermaier <po...@okkam.it>.

Of course I did..but that's not very helpful, is too broad!
I need something more specific to the tuning of HBase for the support of
the run of multiple mapred jobs while updating the tables..

Best,
Flavio


On Tue, May 13, 2014 at 3:20 PM, Ted Yu <yu...@gmail.com> wrote:

> Have you looked at http://hbase.apache.org/book/performance.html ?
>
> Cheers
>
> On May 13, 2014, at 3:14 AM, Flavio Pompermaier <po...@okkam.it>
> wrote:
>
> > So just to summarize the result of this discussion..
> > do you confirm me that the last version of HBase should (in theory)
> support
> > mapreduce jobs on tables that in the meantime could be updated by
> external
> > processes (i.e. not by the mapred job)?
> > One of the answer about this was saying: "Poorly tuned HBase clusters can
> > fail easily under heavy load"..
> > Could you suggest me some tuning to avoid the crashing of HBase in such
> > situations?
> >
> > Best,
> > Flavio
> >
> >
> > On Fri, Apr 11, 2014 at 12:06 PM, Flavio Pompermaier
> > <po...@okkam.it>wrote:
> >
> >> Today I was able to catch an error during a mapreduce job that actually
> >> mimes the rowCount more or less.
> >> The error I saw is:
> >>
> >> ould not sync. Requesting close of hlog
> >> java.io.IOException: Reflection
> >>    at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:230)
> >>    at
> org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1141)
> >>    at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1245)
> >>    at
> org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:1100)
> >>    at java.lang.Thread.run(Thread.java:662)
> >> Caused by: java.lang.reflect.InvocationTargetException
> >>    at sun.reflect.GeneratedMethodAccessor68.invoke(Unknown Source)
> >>    at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>    at java.lang.reflect.Method.invoke(Method.java:597)
> >>    at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:228)
> >>    ... 4 more
> >> Caused by:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /hbase/.logs/host4,60020,1395928532020/host4%2C60020%2C1395928532020.1397205288300
> File does not exist. Holder DFSClient_NONMAPREDUCE_-1746149332_40 does not
> have any open files.
> >>    at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2308)
> >>    at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2299)
> >>    at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2095)
> >>    at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
> >>    at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
> >>    at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
> >>    at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
> >>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
> >>    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
> >>    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
> >>    at java.security.AccessController.doPrivileged(Native Method)
> >>    at javax.security.auth.Subject.doAs(Subject.java:396)
> >>    at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
> >>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
> >>
> >>    at org.apache.hadoop.ipc.Client.call(Client.java:1160)
> >>    at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
> >>    at $Proxy14.addBlock(Unknown Source)
> >>    at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
> >>    at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>    at java.lang.reflect.Method.invoke(Method.java:597)
> >>    at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
> >>    at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> >>    at $Proxy14.addBlock(Unknown Source)
> >>    at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
> >>    at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
> >>    at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
> >>    at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)
> >>
> >>
> >> What can be the cause of this error?
> >>
> >>
> >> On Sat, Apr 5, 2014 at 2:25 PM, Michael Segel <
> michael_segel@hotmail.com>wrote:
> >>
> >>> You have one other thing to consider.
> >>>
> >>> Did you oversubscribe on the m/r tuning side of things.
> >>>
> >>> Many people want to segment their HBase to a portion of the cluster.
> >>> This should be the exception to the design not the primary cluster
> design.
> >>>
> >>> If you over subscribe your cluster, you will run out of memory, then
> you
> >>> need to swap, and boom bad things happen.
> >>>
> >>> Also, while many suggest not reserving room for swap... I suggest that
> >>> you do leave some room.
> >>>
> >>> While this doesn't address the issues in your question directly, they
> are
> >>> something that you need to consider.
> >>>
> >>> More to your point...
> >>> Poorly tuned HBase clusters can fail easily under heavy load.
> >>>
> >>> While Ted doesn't address this... consideration, it can become an
> issue.
> >>>
> >>> YMMV of course.
> >>>
> >>>
> >>>
> >>> On Apr 4, 2014, at 9:43 AM, Ted Yu <yu...@gmail.com> wrote:
> >>>
> >>>> The 'Connection refused' message was logged at WARN level.
> >>>>
> >>>> If you can pastebin more of the region server log before its crash, I
> >>> would
> >>>> be take a deeper look.
> >>>>
> >>>> BTW I assume your zookeeper quorum was healthy during that period of
> >>> time.
> >>>>
> >>>>
> >>>> On Fri, Apr 4, 2014 at 7:29 AM, Flavio Pompermaier <
> >>> pompermaier@okkam.it>wrote:
> >>>>
> >>>>> Yes I know I should update HBase, this is something I'm going to do
> >>> really
> >>>>> soon. Bad me..
> >>>>> I just wanted to know if the fact of adding/updating rows in HBase
> >>> while
> >>>>> running a mapred job could be problematic or not..
> >>>>> From what you told me it's not, so the problem could be caused by the
> >>> old
> >>>>> version of HBase or some other os configuration.
> >>>>> The update was performed via an application accessing HBase directly,
> >>>>> adding and updating rows of the table.
> >>>>> Once in a while some region servers goes down and marked as "bad
> >>> state" by
> >>>>> Cloudera so I have to restart them.
> >>>>>
> >>>>> The error I usually see is:
> >>>>>
> >>>>> 2012-11-23 12:41:00,468 WARN org.apache.zookeeper.ClientCnxn: Session
> >>>>> 0x13b2cf447fd0000 for server null, unexpected error, closing socket
> >>>>> connection and attempting reconnect
> >>>>> java.net.ConnectException: Connection refused
> >>>>>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> >>>>>       at
> >>>>>
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
> >>>>>       at
> >>>
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:286)
> >>>>>       at
> >>>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1047
> >>>>>
> >>>>> Best,
> >>>>> Flavio
> >>>>>
> >>>>> On Fri, Apr 4, 2014 at 2:35 PM, Ted Yu <yu...@gmail.com> wrote:
> >>>>>
> >>>>>> Was the updating performed by one of the mapreduce jobs ?
> >>>>>> HBase should be able to serve multiple mapreduce jobs in the same
> >>>>> cluster.
> >>>>>>
> >>>>>> Can you provide more detail on the crash ?
> >>>>>>
> >>>>>> BTW, there are 3 major releases after 0.92
> >>>>>> Please consider upgrading your cluster to newer release.
> >>>>>>
> >>>>>> Cheers
> >>>>>>
> >>>>>> On Apr 4, 2014, at 3:08 AM, Flavio Pompermaier <
> pompermaier@okkam.it>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi to everybody,
> >>>>>>>
> >>>>>>> I have a probably stupid question: is it a problem to run many
> >>>>> mapreduce
> >>>>>>> jobs on the same HBase table at the same time? And multiple jobs on
> >>>>>>> different tables on the same cluster?
> >>>>>>> Should I use Hoya to have a better cluster usage..?
> >>>>>>>
> >>>>>>> In my current cluster I noticed that the region servers tend to go
> >>> down
> >>>>>> if
> >>>>>>> I run a mapreduce job while updating (maybe it could be related to
> >>> the
> >>>>>> old
> >>>>>>> version of HBase I'm currently running: 0.92.1-cdh4.1.2).
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Flavio
> >>>
> >>> The opinions expressed here are mine, while they may reflect a
> cognitive
> >>> thought, that is purely accidental.
> >>> Use at your own risk.
> >>> Michael Segel
> >>> michael_segel (AT) hotmail.com
> >>>
> >>>
> >>>
> >>>
> >>>
>

Re: HBase cluster design

Posted by Ted Yu <yu...@gmail.com>.

Have you looked at http://hbase.apache.org/book/performance.html ?

Cheers

On May 13, 2014, at 3:14 AM, Flavio Pompermaier <po...@okkam.it> wrote:

> So just to summarize the result of this discussion..
> do you confirm me that the last version of HBase should (in theory) support
> mapreduce jobs on tables that in the meantime could be updated by external
> processes (i.e. not by the mapred job)?
> One of the answer about this was saying: "Poorly tuned HBase clusters can
> fail easily under heavy load"..
> Could you suggest me some tuning to avoid the crashing of HBase in such
> situations?
> 
> Best,
> Flavio
> 
> 
> On Fri, Apr 11, 2014 at 12:06 PM, Flavio Pompermaier
> <po...@okkam.it>wrote:
> 
>> Today I was able to catch an error during a mapreduce job that actually
>> mimes the rowCount more or less.
>> The error I saw is:
>> 
>> ould not sync. Requesting close of hlog
>> java.io.IOException: Reflection
>>    at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:230)
>>    at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1141)
>>    at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1245)
>>    at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:1100)
>>    at java.lang.Thread.run(Thread.java:662)
>> Caused by: java.lang.reflect.InvocationTargetException
>>    at sun.reflect.GeneratedMethodAccessor68.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>    at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:228)
>>    ... 4 more
>> Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /hbase/.logs/host4,60020,1395928532020/host4%2C60020%2C1395928532020.1397205288300 File does not exist. Holder DFSClient_NONMAPREDUCE_-1746149332_40 does not have any open files.
>>    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2308)
>>    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2299)
>>    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2095)
>>    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
>>    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
>>    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
>>    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
>>    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
>>    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
>>    at java.security.AccessController.doPrivileged(Native Method)
>>    at javax.security.auth.Subject.doAs(Subject.java:396)
>>    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
>> 
>>    at org.apache.hadoop.ipc.Client.call(Client.java:1160)
>>    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>>    at $Proxy14.addBlock(Unknown Source)
>>    at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>>    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>>    at $Proxy14.addBlock(Unknown Source)
>>    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
>>    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
>>    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
>>    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)
>> 
>> 
>> What can be the cause of this error?
>> 
>> 
>> On Sat, Apr 5, 2014 at 2:25 PM, Michael Segel <mi...@hotmail.com>wrote:
>> 
>>> You have one other thing to consider.
>>> 
>>> Did you oversubscribe on the m/r tuning side of things.
>>> 
>>> Many people want to segment their HBase to a portion of the cluster.
>>> This should be the exception to the design not the primary cluster design.
>>> 
>>> If you over subscribe your cluster, you will run out of memory, then you
>>> need to swap, and boom bad things happen.
>>> 
>>> Also, while many suggest not reserving room for swap... I suggest that
>>> you do leave some room.
>>> 
>>> While this doesn't address the issues in your question directly, they are
>>> something that you need to consider.
>>> 
>>> More to your point...
>>> Poorly tuned HBase clusters can fail easily under heavy load.
>>> 
>>> While Ted doesn't address this... consideration, it can become an issue.
>>> 
>>> YMMV of course.
>>> 
>>> 
>>> 
>>> On Apr 4, 2014, at 9:43 AM, Ted Yu <yu...@gmail.com> wrote:
>>> 
>>>> The 'Connection refused' message was logged at WARN level.
>>>> 
>>>> If you can pastebin more of the region server log before its crash, I
>>> would
>>>> be take a deeper look.
>>>> 
>>>> BTW I assume your zookeeper quorum was healthy during that period of
>>> time.
>>>> 
>>>> 
>>>> On Fri, Apr 4, 2014 at 7:29 AM, Flavio Pompermaier <
>>> pompermaier@okkam.it>wrote:
>>>> 
>>>>> Yes I know I should update HBase, this is something I'm going to do
>>> really
>>>>> soon. Bad me..
>>>>> I just wanted to know if the fact of adding/updating rows in HBase
>>> while
>>>>> running a mapred job could be problematic or not..
>>>>> From what you told me it's not, so the problem could be caused by the
>>> old
>>>>> version of HBase or some other os configuration.
>>>>> The update was performed via an application accessing HBase directly,
>>>>> adding and updating rows of the table.
>>>>> Once in a while some region servers goes down and marked as "bad
>>> state" by
>>>>> Cloudera so I have to restart them.
>>>>> 
>>>>> The error I usually see is:
>>>>> 
>>>>> 2012-11-23 12:41:00,468 WARN org.apache.zookeeper.ClientCnxn: Session
>>>>> 0x13b2cf447fd0000 for server null, unexpected error, closing socket
>>>>> connection and attempting reconnect
>>>>> java.net.ConnectException: Connection refused
>>>>>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>>>       at
>>>>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
>>>>>       at
>>> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:286)
>>>>>       at
>>>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1047
>>>>> 
>>>>> Best,
>>>>> Flavio
>>>>> 
>>>>> On Fri, Apr 4, 2014 at 2:35 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>> 
>>>>>> Was the updating performed by one of the mapreduce jobs ?
>>>>>> HBase should be able to serve multiple mapreduce jobs in the same
>>>>> cluster.
>>>>>> 
>>>>>> Can you provide more detail on the crash ?
>>>>>> 
>>>>>> BTW, there are 3 major releases after 0.92
>>>>>> Please consider upgrading your cluster to newer release.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> On Apr 4, 2014, at 3:08 AM, Flavio Pompermaier <po...@okkam.it>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi to everybody,
>>>>>>> 
>>>>>>> I have a probably stupid question: is it a problem to run many
>>>>> mapreduce
>>>>>>> jobs on the same HBase table at the same time? And multiple jobs on
>>>>>>> different tables on the same cluster?
>>>>>>> Should I use Hoya to have a better cluster usage..?
>>>>>>> 
>>>>>>> In my current cluster I noticed that the region servers tend to go
>>> down
>>>>>> if
>>>>>>> I run a mapreduce job while updating (maybe it could be related to
>>> the
>>>>>> old
>>>>>>> version of HBase I'm currently running: 0.92.1-cdh4.1.2).
>>>>>>> 
>>>>>>> Best,
>>>>>>> Flavio
>>> 
>>> The opinions expressed here are mine, while they may reflect a cognitive
>>> thought, that is purely accidental.
>>> Use at your own risk.
>>> Michael Segel
>>> michael_segel (AT) hotmail.com
>>> 
>>> 
>>> 
>>> 
>>>

Re: HBase cluster design

Posted by Flavio Pompermaier <po...@okkam.it>.

So just to summarize the result of this discussion..
do you confirm me that the last version of HBase should (in theory) support
mapreduce jobs on tables that in the meantime could be updated by external
processes (i.e. not by the mapred job)?
One of the answer about this was saying: "Poorly tuned HBase clusters can
fail easily under heavy load"..
Could you suggest me some tuning to avoid the crashing of HBase in such
situations?

Best,
Flavio


On Fri, Apr 11, 2014 at 12:06 PM, Flavio Pompermaier
<po...@okkam.it>wrote:

> Today I was able to catch an error during a mapreduce job that actually
> mimes the rowCount more or less.
> The error I saw is:
>
> ould not sync. Requesting close of hlog
> java.io.IOException: Reflection
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:230)
> 	at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1141)
> 	at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1245)
> 	at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:1100)
> 	at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.reflect.InvocationTargetException
> 	at sun.reflect.GeneratedMethodAccessor68.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:228)
> 	... 4 more
> Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /hbase/.logs/host4,60020,1395928532020/host4%2C60020%2C1395928532020.1397205288300 File does not exist. Holder DFSClient_NONMAPREDUCE_-1746149332_40 does not have any open files.
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2308)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2299)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2095)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
>
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1160)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
> 	at $Proxy14.addBlock(Unknown Source)
> 	at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> 	at $Proxy14.addBlock(Unknown Source)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)
>
>
> What can be the cause of this error?
>
>
> On Sat, Apr 5, 2014 at 2:25 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> You have one other thing to consider.
>>
>> Did you oversubscribe on the m/r tuning side of things.
>>
>> Many people want to segment their HBase to a portion of the cluster.
>> This should be the exception to the design not the primary cluster design.
>>
>> If you over subscribe your cluster, you will run out of memory, then you
>> need to swap, and boom bad things happen.
>>
>> Also, while many suggest not reserving room for swap... I suggest that
>> you do leave some room.
>>
>> While this doesn't address the issues in your question directly, they are
>> something that you need to consider.
>>
>> More to your point...
>> Poorly tuned HBase clusters can fail easily under heavy load.
>>
>> While Ted doesn't address this... consideration, it can become an issue.
>>
>> YMMV of course.
>>
>>
>>
>> On Apr 4, 2014, at 9:43 AM, Ted Yu <yu...@gmail.com> wrote:
>>
>> > The 'Connection refused' message was logged at WARN level.
>> >
>> > If you can pastebin more of the region server log before its crash, I
>> would
>> > be take a deeper look.
>> >
>> > BTW I assume your zookeeper quorum was healthy during that period of
>> time.
>> >
>> >
>> > On Fri, Apr 4, 2014 at 7:29 AM, Flavio Pompermaier <
>> pompermaier@okkam.it>wrote:
>> >
>> >> Yes I know I should update HBase, this is something I'm going to do
>> really
>> >> soon. Bad me..
>> >> I just wanted to know if the fact of adding/updating rows in HBase
>> while
>> >> running a mapred job could be problematic or not..
>> >> From what you told me it's not, so the problem could be caused by the
>> old
>> >> version of HBase or some other os configuration.
>> >> The update was performed via an application accessing HBase directly,
>> >> adding and updating rows of the table.
>> >> Once in a while some region servers goes down and marked as "bad
>> state" by
>> >> Cloudera so I have to restart them.
>> >>
>> >> The error I usually see is:
>> >>
>> >> 2012-11-23 12:41:00,468 WARN org.apache.zookeeper.ClientCnxn: Session
>> >> 0x13b2cf447fd0000 for server null, unexpected error, closing socket
>> >> connection and attempting reconnect
>> >> java.net.ConnectException: Connection refused
>> >>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>> >>        at
>> >> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
>> >>        at
>> >>
>> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:286)
>> >>        at
>> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1047
>> >>
>> >> Best,
>> >> Flavio
>> >>
>> >> On Fri, Apr 4, 2014 at 2:35 PM, Ted Yu <yu...@gmail.com> wrote:
>> >>
>> >>> Was the updating performed by one of the mapreduce jobs ?
>> >>> HBase should be able to serve multiple mapreduce jobs in the same
>> >> cluster.
>> >>>
>> >>> Can you provide more detail on the crash ?
>> >>>
>> >>> BTW, there are 3 major releases after 0.92
>> >>> Please consider upgrading your cluster to newer release.
>> >>>
>> >>> Cheers
>> >>>
>> >>> On Apr 4, 2014, at 3:08 AM, Flavio Pompermaier <po...@okkam.it>
>> >>> wrote:
>> >>>
>> >>>> Hi to everybody,
>> >>>>
>> >>>> I have a probably stupid question: is it a problem to run many
>> >> mapreduce
>> >>>> jobs on the same HBase table at the same time? And multiple jobs on
>> >>>> different tables on the same cluster?
>> >>>> Should I use Hoya to have a better cluster usage..?
>> >>>>
>> >>>> In my current cluster I noticed that the region servers tend to go
>> down
>> >>> if
>> >>>> I run a mapreduce job while updating (maybe it could be related to
>> the
>> >>> old
>> >>>> version of HBase I'm currently running: 0.92.1-cdh4.1.2).
>> >>>>
>> >>>> Best,
>> >>>> Flavio
>> >>>
>> >>
>>
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>>
>>
>>
>>
>>

Re: HBase cluster design

Posted by Flavio Pompermaier <po...@okkam.it>.

Today I was able to catch an error during a mapreduce job that actually
mimes the rowCount more or less.
The error I saw is:

ould not sync. Requesting close of hlog
java.io.IOException: Reflection
	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:230)
	at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1141)
	at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1245)
	at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:1100)
	at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.GeneratedMethodAccessor68.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:228)
	... 4 more
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on /hbase/.logs/host4,60020,1395928532020/host4%2C60020%2C1395928532020.1397205288300
File does not exist. Holder DFSClient_NONMAPREDUCE_-1746149332_40 does
not have any open files.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2308)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2299)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2095)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)

	at org.apache.hadoop.ipc.Client.call(Client.java:1160)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
	at $Proxy14.addBlock(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
	at $Proxy14.addBlock(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)


What can be the cause of this error?

On Sat, Apr 5, 2014 at 2:25 PM, Michael Segel <mi...@hotmail.com>wrote:

> You have one other thing to consider.
>
> Did you oversubscribe on the m/r tuning side of things.
>
> Many people want to segment their HBase to a portion of the cluster.
> This should be the exception to the design not the primary cluster design.
>
> If you over subscribe your cluster, you will run out of memory, then you
> need to swap, and boom bad things happen.
>
> Also, while many suggest not reserving room for swap... I suggest that you
> do leave some room.
>
> While this doesn't address the issues in your question directly, they are
> something that you need to consider.
>
> More to your point...
> Poorly tuned HBase clusters can fail easily under heavy load.
>
> While Ted doesn't address this... consideration, it can become an issue.
>
> YMMV of course.
>
>
>
> On Apr 4, 2014, at 9:43 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > The 'Connection refused' message was logged at WARN level.
> >
> > If you can pastebin more of the region server log before its crash, I
> would
> > be take a deeper look.
> >
> > BTW I assume your zookeeper quorum was healthy during that period of
> time.
> >
> >
> > On Fri, Apr 4, 2014 at 7:29 AM, Flavio Pompermaier <pompermaier@okkam.it
> >wrote:
> >
> >> Yes I know I should update HBase, this is something I'm going to do
> really
> >> soon. Bad me..
> >> I just wanted to know if the fact of adding/updating rows in HBase while
> >> running a mapred job could be problematic or not..
> >> From what you told me it's not, so the problem could be caused by the
> old
> >> version of HBase or some other os configuration.
> >> The update was performed via an application accessing HBase directly,
> >> adding and updating rows of the table.
> >> Once in a while some region servers goes down and marked as "bad state"
> by
> >> Cloudera so I have to restart them.
> >>
> >> The error I usually see is:
> >>
> >> 2012-11-23 12:41:00,468 WARN org.apache.zookeeper.ClientCnxn: Session
> >> 0x13b2cf447fd0000 for server null, unexpected error, closing socket
> >> connection and attempting reconnect
> >> java.net.ConnectException: Connection refused
> >>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> >>        at
> >> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
> >>        at
> >>
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:286)
> >>        at
> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1047
> >>
> >> Best,
> >> Flavio
> >>
> >> On Fri, Apr 4, 2014 at 2:35 PM, Ted Yu <yu...@gmail.com> wrote:
> >>
> >>> Was the updating performed by one of the mapreduce jobs ?
> >>> HBase should be able to serve multiple mapreduce jobs in the same
> >> cluster.
> >>>
> >>> Can you provide more detail on the crash ?
> >>>
> >>> BTW, there are 3 major releases after 0.92
> >>> Please consider upgrading your cluster to newer release.
> >>>
> >>> Cheers
> >>>
> >>> On Apr 4, 2014, at 3:08 AM, Flavio Pompermaier <po...@okkam.it>
> >>> wrote:
> >>>
> >>>> Hi to everybody,
> >>>>
> >>>> I have a probably stupid question: is it a problem to run many
> >> mapreduce
> >>>> jobs on the same HBase table at the same time? And multiple jobs on
> >>>> different tables on the same cluster?
> >>>> Should I use Hoya to have a better cluster usage..?
> >>>>
> >>>> In my current cluster I noticed that the region servers tend to go
> down
> >>> if
> >>>> I run a mapreduce job while updating (maybe it could be related to the
> >>> old
> >>>> version of HBase I'm currently running: 0.92.1-cdh4.1.2).
> >>>>
> >>>> Best,
> >>>> Flavio
> >>>
> >>
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>

Re: HBase cluster design

Posted by Michael Segel <mi...@hotmail.com>.

You have one other thing to consider. 

Did you oversubscribe on the m/r tuning side of things. 

Many people want to segment their HBase to a portion of the cluster. 
This should be the exception to the design not the primary cluster design. 

If you over subscribe your cluster, you will run out of memory, then you need to swap, and boom bad things happen. 

Also, while many suggest not reserving room for swap... I suggest that you do leave some room. 

While this doesn't address the issues in your question directly, they are something that you need to consider. 

More to your point... 
Poorly tuned HBase clusters can fail easily under heavy load. 

While Ted doesn't address this... consideration, it can become an issue. 

YMMV of course. 



On Apr 4, 2014, at 9:43 AM, Ted Yu <yu...@gmail.com> wrote:

> The 'Connection refused' message was logged at WARN level.
> 
> If you can pastebin more of the region server log before its crash, I would
> be take a deeper look.
> 
> BTW I assume your zookeeper quorum was healthy during that period of time.
> 
> 
> On Fri, Apr 4, 2014 at 7:29 AM, Flavio Pompermaier <po...@okkam.it>wrote:
> 
>> Yes I know I should update HBase, this is something I'm going to do really
>> soon. Bad me..
>> I just wanted to know if the fact of adding/updating rows in HBase while
>> running a mapred job could be problematic or not..
>> From what you told me it's not, so the problem could be caused by the old
>> version of HBase or some other os configuration.
>> The update was performed via an application accessing HBase directly,
>> adding and updating rows of the table.
>> Once in a while some region servers goes down and marked as "bad state" by
>> Cloudera so I have to restart them.
>> 
>> The error I usually see is:
>> 
>> 2012-11-23 12:41:00,468 WARN org.apache.zookeeper.ClientCnxn: Session
>> 0x13b2cf447fd0000 for server null, unexpected error, closing socket
>> connection and attempting reconnect
>> java.net.ConnectException: Connection refused
>>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>        at
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
>>        at
>> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:286)
>>        at
>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1047
>> 
>> Best,
>> Flavio
>> 
>> On Fri, Apr 4, 2014 at 2:35 PM, Ted Yu <yu...@gmail.com> wrote:
>> 
>>> Was the updating performed by one of the mapreduce jobs ?
>>> HBase should be able to serve multiple mapreduce jobs in the same
>> cluster.
>>> 
>>> Can you provide more detail on the crash ?
>>> 
>>> BTW, there are 3 major releases after 0.92
>>> Please consider upgrading your cluster to newer release.
>>> 
>>> Cheers
>>> 
>>> On Apr 4, 2014, at 3:08 AM, Flavio Pompermaier <po...@okkam.it>
>>> wrote:
>>> 
>>>> Hi to everybody,
>>>> 
>>>> I have a probably stupid question: is it a problem to run many
>> mapreduce
>>>> jobs on the same HBase table at the same time? And multiple jobs on
>>>> different tables on the same cluster?
>>>> Should I use Hoya to have a better cluster usage..?
>>>> 
>>>> In my current cluster I noticed that the region servers tend to go down
>>> if
>>>> I run a mapreduce job while updating (maybe it could be related to the
>>> old
>>>> version of HBase I'm currently running: 0.92.1-cdh4.1.2).
>>>> 
>>>> Best,
>>>> Flavio
>>> 
>> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: HBase cluster design

Posted by Ted Yu <yu...@gmail.com>.

The 'Connection refused' message was logged at WARN level.

If you can pastebin more of the region server log before its crash, I would
be take a deeper look.

BTW I assume your zookeeper quorum was healthy during that period of time.


On Fri, Apr 4, 2014 at 7:29 AM, Flavio Pompermaier <po...@okkam.it>wrote:

> Yes I know I should update HBase, this is something I'm going to do really
> soon. Bad me..
> I just wanted to know if the fact of adding/updating rows in HBase while
> running a mapred job could be problematic or not..
> From what you told me it's not, so the problem could be caused by the old
> version of HBase or some other os configuration.
> The update was performed via an application accessing HBase directly,
> adding and updating rows of the table.
> Once in a while some region servers goes down and marked as "bad state" by
> Cloudera so I have to restart them.
>
> The error I usually see is:
>
> 2012-11-23 12:41:00,468 WARN org.apache.zookeeper.ClientCnxn: Session
> 0x13b2cf447fd0000 for server null, unexpected error, closing socket
> connection and attempting reconnect
> java.net.ConnectException: Connection refused
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
>         at
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:286)
>         at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1047
>
> Best,
> Flavio
>
> On Fri, Apr 4, 2014 at 2:35 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Was the updating performed by one of the mapreduce jobs ?
> > HBase should be able to serve multiple mapreduce jobs in the same
> cluster.
> >
> > Can you provide more detail on the crash ?
> >
> > BTW, there are 3 major releases after 0.92
> > Please consider upgrading your cluster to newer release.
> >
> > Cheers
> >
> > On Apr 4, 2014, at 3:08 AM, Flavio Pompermaier <po...@okkam.it>
> > wrote:
> >
> > > Hi to everybody,
> > >
> > > I have a probably stupid question: is it a problem to run many
> mapreduce
> > > jobs on the same HBase table at the same time? And multiple jobs on
> > > different tables on the same cluster?
> > > Should I use Hoya to have a better cluster usage..?
> > >
> > > In my current cluster I noticed that the region servers tend to go down
> > if
> > > I run a mapreduce job while updating (maybe it could be related to the
> > old
> > > version of HBase I'm currently running: 0.92.1-cdh4.1.2).
> > >
> > > Best,
> > > Flavio
> >
>

Re: HBase cluster design

Posted by Flavio Pompermaier <po...@okkam.it>.

Yes I know I should update HBase, this is something I'm going to do really
soon. Bad me..
I just wanted to know if the fact of adding/updating rows in HBase while
running a mapred job could be problematic or not..
>From what you told me it's not, so the problem could be caused by the old
version of HBase or some other os configuration.
The update was performed via an application accessing HBase directly,
adding and updating rows of the table.
Once in a while some region servers goes down and marked as "bad state" by
Cloudera so I have to restart them.

The error I usually see is:

2012-11-23 12:41:00,468 WARN org.apache.zookeeper.ClientCnxn: Session
0x13b2cf447fd0000 for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:286)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1047

Best,
Flavio

On Fri, Apr 4, 2014 at 2:35 PM, Ted Yu <yu...@gmail.com> wrote:

> Was the updating performed by one of the mapreduce jobs ?
> HBase should be able to serve multiple mapreduce jobs in the same cluster.
>
> Can you provide more detail on the crash ?
>
> BTW, there are 3 major releases after 0.92
> Please consider upgrading your cluster to newer release.
>
> Cheers
>
> On Apr 4, 2014, at 3:08 AM, Flavio Pompermaier <po...@okkam.it>
> wrote:
>
> > Hi to everybody,
> >
> > I have a probably stupid question: is it a problem to run many mapreduce
> > jobs on the same HBase table at the same time? And multiple jobs on
> > different tables on the same cluster?
> > Should I use Hoya to have a better cluster usage..?
> >
> > In my current cluster I noticed that the region servers tend to go down
> if
> > I run a mapreduce job while updating (maybe it could be related to the
> old
> > version of HBase I'm currently running: 0.92.1-cdh4.1.2).
> >
> > Best,
> > Flavio
>

Re: HBase cluster design

Posted by Ted Yu <yu...@gmail.com>.

Was the updating performed by one of the mapreduce jobs ?
HBase should be able to serve multiple mapreduce jobs in the same cluster. 

Can you provide more detail on the crash ?

BTW, there are 3 major releases after 0.92
Please consider upgrading your cluster to newer release. 

Cheers

On Apr 4, 2014, at 3:08 AM, Flavio Pompermaier <po...@okkam.it> wrote:

> Hi to everybody,
> 
> I have a probably stupid question: is it a problem to run many mapreduce
> jobs on the same HBase table at the same time? And multiple jobs on
> different tables on the same cluster?
> Should I use Hoya to have a better cluster usage..?
> 
> In my current cluster I noticed that the region servers tend to go down if
> I run a mapreduce job while updating (maybe it could be related to the old
> version of HBase I'm currently running: 0.92.1-cdh4.1.2).
> 
> Best,
> Flavio