You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Marcin Tustin <mt...@handybook.com> on 2015/12/19 17:38:46 UTC

HDFS HA Namenodes crash all the time

Hi All,

We have just switched over to HA namenodes with ZK failover, using
HDP-2.3.0.0-2557
(HDFS 2.7.1.2.3). I'm looking for suggestions as to what to investigate to
make this more stable.

Before we went to HA our namenode was reasonably stable. Now, the namenodes
are crashing multiple times a day, and frequently failing to fail over
correctly; to the point where I can't even use haadmin -transitionToActive
to force a failover. I find that instead I have to restart the namenodes.

We're running them on AWS instances with 31.01GB and 8 cores. In addition
to the namenode, we host a journalnode, a zkfailovercontroller, and the
ambari metrics collector on the same machine. (The third journalnode lives
with the yarn resource manager).

Right now the namenodes are configured with a maximum heap of 25 GB.

Does that sound credible? What else should we be paying attention to to
make HDFS stable again?

With thanks,
Marcin

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led 
by Fidelity


Re: HDFS HA Namenodes crash all the time

Posted by Nikhil <mn...@gmail.com>.
check zkfc logs first ; try checking the HDFS ha and zookeeper timeouts,
its better to have a dedicated disk for journal node service (similar to
zookeeper)

On Sat, Dec 19, 2015 at 9:29 AM, Sandeep Nemuri <nh...@gmail.com>
wrote:

> What does the logs say ?
> ᐧ
>
> On Sat, Dec 19, 2015 at 10:08 PM, Marcin Tustin <mt...@handybook.com>
> wrote:
>
>> Hi All,
>>
>> We have just switched over to HA namenodes with ZK failover, using HDP-2.3.0.0-2557
>> (HDFS 2.7.1.2.3). I'm looking for suggestions as to what to investigate to
>> make this more stable.
>>
>> Before we went to HA our namenode was reasonably stable. Now, the
>> namenodes are crashing multiple times a day, and frequently failing to fail
>> over correctly; to the point where I can't even use haadmin
>> -transitionToActive to force a failover. I find that instead I have to
>> restart the namenodes.
>>
>> We're running them on AWS instances with 31.01GB and 8 cores. In
>> addition to the namenode, we host a journalnode, a zkfailovercontroller,
>> and the ambari metrics collector on the same machine. (The third
>> journalnode lives with the yarn resource manager).
>>
>> Right now the namenodes are configured with a maximum heap of 25 GB.
>>
>> Does that sound credible? What else should we be paying attention to to
>> make HDFS stable again?
>>
>> With thanks,
>> Marcin
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>> by Fidelity
>>
>>
>
>
> --
> *  Regards*
> *  Sandeep Nemuri*
>

Re: HDFS HA Namenodes crash all the time

Posted by Nikhil <mn...@gmail.com>.
check zkfc logs first ; try checking the HDFS ha and zookeeper timeouts,
its better to have a dedicated disk for journal node service (similar to
zookeeper)

On Sat, Dec 19, 2015 at 9:29 AM, Sandeep Nemuri <nh...@gmail.com>
wrote:

> What does the logs say ?
> ᐧ
>
> On Sat, Dec 19, 2015 at 10:08 PM, Marcin Tustin <mt...@handybook.com>
> wrote:
>
>> Hi All,
>>
>> We have just switched over to HA namenodes with ZK failover, using HDP-2.3.0.0-2557
>> (HDFS 2.7.1.2.3). I'm looking for suggestions as to what to investigate to
>> make this more stable.
>>
>> Before we went to HA our namenode was reasonably stable. Now, the
>> namenodes are crashing multiple times a day, and frequently failing to fail
>> over correctly; to the point where I can't even use haadmin
>> -transitionToActive to force a failover. I find that instead I have to
>> restart the namenodes.
>>
>> We're running them on AWS instances with 31.01GB and 8 cores. In
>> addition to the namenode, we host a journalnode, a zkfailovercontroller,
>> and the ambari metrics collector on the same machine. (The third
>> journalnode lives with the yarn resource manager).
>>
>> Right now the namenodes are configured with a maximum heap of 25 GB.
>>
>> Does that sound credible? What else should we be paying attention to to
>> make HDFS stable again?
>>
>> With thanks,
>> Marcin
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>> by Fidelity
>>
>>
>
>
> --
> *  Regards*
> *  Sandeep Nemuri*
>

Re: HDFS HA Namenodes crash all the time

Posted by Nikhil <mn...@gmail.com>.
check zkfc logs first ; try checking the HDFS ha and zookeeper timeouts,
its better to have a dedicated disk for journal node service (similar to
zookeeper)

On Sat, Dec 19, 2015 at 9:29 AM, Sandeep Nemuri <nh...@gmail.com>
wrote:

> What does the logs say ?
> ᐧ
>
> On Sat, Dec 19, 2015 at 10:08 PM, Marcin Tustin <mt...@handybook.com>
> wrote:
>
>> Hi All,
>>
>> We have just switched over to HA namenodes with ZK failover, using HDP-2.3.0.0-2557
>> (HDFS 2.7.1.2.3). I'm looking for suggestions as to what to investigate to
>> make this more stable.
>>
>> Before we went to HA our namenode was reasonably stable. Now, the
>> namenodes are crashing multiple times a day, and frequently failing to fail
>> over correctly; to the point where I can't even use haadmin
>> -transitionToActive to force a failover. I find that instead I have to
>> restart the namenodes.
>>
>> We're running them on AWS instances with 31.01GB and 8 cores. In
>> addition to the namenode, we host a journalnode, a zkfailovercontroller,
>> and the ambari metrics collector on the same machine. (The third
>> journalnode lives with the yarn resource manager).
>>
>> Right now the namenodes are configured with a maximum heap of 25 GB.
>>
>> Does that sound credible? What else should we be paying attention to to
>> make HDFS stable again?
>>
>> With thanks,
>> Marcin
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>> by Fidelity
>>
>>
>
>
> --
> *  Regards*
> *  Sandeep Nemuri*
>

Re: HDFS HA Namenodes crash all the time

Posted by Nikhil <mn...@gmail.com>.
check zkfc logs first ; try checking the HDFS ha and zookeeper timeouts,
its better to have a dedicated disk for journal node service (similar to
zookeeper)

On Sat, Dec 19, 2015 at 9:29 AM, Sandeep Nemuri <nh...@gmail.com>
wrote:

> What does the logs say ?
> ᐧ
>
> On Sat, Dec 19, 2015 at 10:08 PM, Marcin Tustin <mt...@handybook.com>
> wrote:
>
>> Hi All,
>>
>> We have just switched over to HA namenodes with ZK failover, using HDP-2.3.0.0-2557
>> (HDFS 2.7.1.2.3). I'm looking for suggestions as to what to investigate to
>> make this more stable.
>>
>> Before we went to HA our namenode was reasonably stable. Now, the
>> namenodes are crashing multiple times a day, and frequently failing to fail
>> over correctly; to the point where I can't even use haadmin
>> -transitionToActive to force a failover. I find that instead I have to
>> restart the namenodes.
>>
>> We're running them on AWS instances with 31.01GB and 8 cores. In
>> addition to the namenode, we host a journalnode, a zkfailovercontroller,
>> and the ambari metrics collector on the same machine. (The third
>> journalnode lives with the yarn resource manager).
>>
>> Right now the namenodes are configured with a maximum heap of 25 GB.
>>
>> Does that sound credible? What else should we be paying attention to to
>> make HDFS stable again?
>>
>> With thanks,
>> Marcin
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>> by Fidelity
>>
>>
>
>
> --
> *  Regards*
> *  Sandeep Nemuri*
>

Re: HDFS HA Namenodes crash all the time

Posted by Sandeep Nemuri <nh...@gmail.com>.
What does the logs say ?
ᐧ

On Sat, Dec 19, 2015 at 10:08 PM, Marcin Tustin <mt...@handybook.com>
wrote:

> Hi All,
>
> We have just switched over to HA namenodes with ZK failover, using HDP-2.3.0.0-2557
> (HDFS 2.7.1.2.3). I'm looking for suggestions as to what to investigate to
> make this more stable.
>
> Before we went to HA our namenode was reasonably stable. Now, the
> namenodes are crashing multiple times a day, and frequently failing to fail
> over correctly; to the point where I can't even use haadmin
> -transitionToActive to force a failover. I find that instead I have to
> restart the namenodes.
>
> We're running them on AWS instances with 31.01GB and 8 cores. In addition
> to the namenode, we host a journalnode, a zkfailovercontroller, and the
> ambari metrics collector on the same machine. (The third journalnode lives
> with the yarn resource manager).
>
> Right now the namenodes are configured with a maximum heap of 25 GB.
>
> Does that sound credible? What else should we be paying attention to to
> make HDFS stable again?
>
> With thanks,
> Marcin
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
> by Fidelity
>
>


-- 
*  Regards*
*  Sandeep Nemuri*

Re: HDFS HA Namenodes crash all the time

Posted by Sandeep Nemuri <nh...@gmail.com>.
What does the logs say ?
ᐧ

On Sat, Dec 19, 2015 at 10:08 PM, Marcin Tustin <mt...@handybook.com>
wrote:

> Hi All,
>
> We have just switched over to HA namenodes with ZK failover, using HDP-2.3.0.0-2557
> (HDFS 2.7.1.2.3). I'm looking for suggestions as to what to investigate to
> make this more stable.
>
> Before we went to HA our namenode was reasonably stable. Now, the
> namenodes are crashing multiple times a day, and frequently failing to fail
> over correctly; to the point where I can't even use haadmin
> -transitionToActive to force a failover. I find that instead I have to
> restart the namenodes.
>
> We're running them on AWS instances with 31.01GB and 8 cores. In addition
> to the namenode, we host a journalnode, a zkfailovercontroller, and the
> ambari metrics collector on the same machine. (The third journalnode lives
> with the yarn resource manager).
>
> Right now the namenodes are configured with a maximum heap of 25 GB.
>
> Does that sound credible? What else should we be paying attention to to
> make HDFS stable again?
>
> With thanks,
> Marcin
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
> by Fidelity
>
>


-- 
*  Regards*
*  Sandeep Nemuri*

Re: HDFS HA Namenodes crash all the time

Posted by Sandeep Nemuri <nh...@gmail.com>.
What does the logs say ?
ᐧ

On Sat, Dec 19, 2015 at 10:08 PM, Marcin Tustin <mt...@handybook.com>
wrote:

> Hi All,
>
> We have just switched over to HA namenodes with ZK failover, using HDP-2.3.0.0-2557
> (HDFS 2.7.1.2.3). I'm looking for suggestions as to what to investigate to
> make this more stable.
>
> Before we went to HA our namenode was reasonably stable. Now, the
> namenodes are crashing multiple times a day, and frequently failing to fail
> over correctly; to the point where I can't even use haadmin
> -transitionToActive to force a failover. I find that instead I have to
> restart the namenodes.
>
> We're running them on AWS instances with 31.01GB and 8 cores. In addition
> to the namenode, we host a journalnode, a zkfailovercontroller, and the
> ambari metrics collector on the same machine. (The third journalnode lives
> with the yarn resource manager).
>
> Right now the namenodes are configured with a maximum heap of 25 GB.
>
> Does that sound credible? What else should we be paying attention to to
> make HDFS stable again?
>
> With thanks,
> Marcin
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
> by Fidelity
>
>


-- 
*  Regards*
*  Sandeep Nemuri*

Re: HDFS HA Namenodes crash all the time

Posted by Sandeep Nemuri <nh...@gmail.com>.
What does the logs say ?
ᐧ

On Sat, Dec 19, 2015 at 10:08 PM, Marcin Tustin <mt...@handybook.com>
wrote:

> Hi All,
>
> We have just switched over to HA namenodes with ZK failover, using HDP-2.3.0.0-2557
> (HDFS 2.7.1.2.3). I'm looking for suggestions as to what to investigate to
> make this more stable.
>
> Before we went to HA our namenode was reasonably stable. Now, the
> namenodes are crashing multiple times a day, and frequently failing to fail
> over correctly; to the point where I can't even use haadmin
> -transitionToActive to force a failover. I find that instead I have to
> restart the namenodes.
>
> We're running them on AWS instances with 31.01GB and 8 cores. In addition
> to the namenode, we host a journalnode, a zkfailovercontroller, and the
> ambari metrics collector on the same machine. (The third journalnode lives
> with the yarn resource manager).
>
> Right now the namenodes are configured with a maximum heap of 25 GB.
>
> Does that sound credible? What else should we be paying attention to to
> make HDFS stable again?
>
> With thanks,
> Marcin
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
> by Fidelity
>
>


-- 
*  Regards*
*  Sandeep Nemuri*