You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by Mickey <hu...@gmail.com> on 2013/08/29 10:18:04 UTC

Long time fail over when using QJM

Hi, all
I tried to test the QJM HA and it always works good. But, yestoday I met an
quite long time fail over with QJM. The test is base on the CDH4.3.0.
The attachment is the standby namenode and the journalnode 's logs.
The network cable on active namenode(also a datanode) was pulled out at
about 07:24. From the standby-namenode log I found log like this:
2013-08-28 07:24:51,122 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 1
Total time for transactions(ms): 1Number of transactions batched in Syncs:
0 Number of syncs: 0 SyncTimes(ms): 0 41 42
2013-08-28 07:36:14,028 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions:
32 Total time for transactions(ms): 3Number of transactions batched in
Syncs: 0 Number of syncs: 1 SyncTimes(ms): 9 49 46

The information seems regular. The problem is that between the 2 lines
there's no log  in 12 minutes. There is no long gc happened. It seems the
code blocked somewhere. Unfortunately, I forgot to print the jstack info
T_T.

Hope for your response.

Best regards,
Mickey

Re: Long time fail over when using QJM

Posted by Mickey <hu...@gmail.com>.
Sorry for the empty mail.

Thanks, Todd.
In the test my HBase doesn't work for a long time. Maybe there's something
wrong in my HBase.I will try to do more tests.

Thanks,
Mickey


2013/8/30 Mickey <hu...@gmail.com>

>
>
>
> 2013/8/30 Todd Lipcon <to...@cloudera.com>
>
>> If you're seeing those log messages, the SBN was already active at that
>> time. It only logs that message when successfully writing transactions.
>> So,
>> the failover must have already completed before the logs you're looking
>> at.
>>
>> -Todd
>>
>> On Thu, Aug 29, 2013 at 1:18 AM, Mickey <hu...@gmail.com> wrote:
>>
>> > Hi, all
>> > I tried to test the QJM HA and it always works good. But, yestoday I met
>> > an quite long time fail over with QJM. The test is base on the CDH4.3.0.
>> > The attachment is the standby namenode and the journalnode 's logs.
>> > The network cable on active namenode(also a datanode) was pulled out at
>> > about 07:24. From the standby-namenode log I found log like this:
>> > 2013-08-28 07:24:51,122 INFO
>> > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of
>> transactions: 1
>> > Total time for transactions(ms): 1Number of transactions batched in
>> Syncs:
>> > 0 Number of syncs: 0 SyncTimes(ms): 0 41 42
>> > 2013-08-28 07:36:14,028 INFO
>> > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of
>> transactions:
>> > 32 Total time for transactions(ms): 3Number of transactions batched in
>> > Syncs: 0 Number of syncs: 1 SyncTimes(ms): 9 49 46
>> >
>> > The information seems regular. The problem is that between the 2 lines
>> > there's no log  in 12 minutes. There is no long gc happened. It seems
>> the
>> > code blocked somewhere. Unfortunately, I forgot to print the jstack info
>> > T_T.
>> >
>> > Hope for your response.
>> >
>> > Best regards,
>> > Mickey
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>

Re: Long time fail over when using QJM

Posted by Mickey <hu...@gmail.com>.
2013/8/30 Todd Lipcon <to...@cloudera.com>

> If you're seeing those log messages, the SBN was already active at that
> time. It only logs that message when successfully writing transactions. So,
> the failover must have already completed before the logs you're looking at.
>
> -Todd
>
> On Thu, Aug 29, 2013 at 1:18 AM, Mickey <hu...@gmail.com> wrote:
>
> > Hi, all
> > I tried to test the QJM HA and it always works good. But, yestoday I met
> > an quite long time fail over with QJM. The test is base on the CDH4.3.0.
> > The attachment is the standby namenode and the journalnode 's logs.
> > The network cable on active namenode(also a datanode) was pulled out at
> > about 07:24. From the standby-namenode log I found log like this:
> > 2013-08-28 07:24:51,122 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of
> transactions: 1
> > Total time for transactions(ms): 1Number of transactions batched in
> Syncs:
> > 0 Number of syncs: 0 SyncTimes(ms): 0 41 42
> > 2013-08-28 07:36:14,028 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions:
> > 32 Total time for transactions(ms): 3Number of transactions batched in
> > Syncs: 0 Number of syncs: 1 SyncTimes(ms): 9 49 46
> >
> > The information seems regular. The problem is that between the 2 lines
> > there's no log  in 12 minutes. There is no long gc happened. It seems the
> > code blocked somewhere. Unfortunately, I forgot to print the jstack info
> > T_T.
> >
> > Hope for your response.
> >
> > Best regards,
> > Mickey
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Long time fail over when using QJM

Posted by Todd Lipcon <to...@cloudera.com>.
If you're seeing those log messages, the SBN was already active at that
time. It only logs that message when successfully writing transactions. So,
the failover must have already completed before the logs you're looking at.

-Todd

On Thu, Aug 29, 2013 at 1:18 AM, Mickey <hu...@gmail.com> wrote:

> Hi, all
> I tried to test the QJM HA and it always works good. But, yestoday I met
> an quite long time fail over with QJM. The test is base on the CDH4.3.0.
> The attachment is the standby namenode and the journalnode 's logs.
> The network cable on active namenode(also a datanode) was pulled out at
> about 07:24. From the standby-namenode log I found log like this:
> 2013-08-28 07:24:51,122 INFO
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 1
> Total time for transactions(ms): 1Number of transactions batched in Syncs:
> 0 Number of syncs: 0 SyncTimes(ms): 0 41 42
> 2013-08-28 07:36:14,028 INFO
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions:
> 32 Total time for transactions(ms): 3Number of transactions batched in
> Syncs: 0 Number of syncs: 1 SyncTimes(ms): 9 49 46
>
> The information seems regular. The problem is that between the 2 lines
> there's no log  in 12 minutes. There is no long gc happened. It seems the
> code blocked somewhere. Unfortunately, I forgot to print the jstack info
> T_T.
>
> Hope for your response.
>
> Best regards,
> Mickey
>



-- 
Todd Lipcon
Software Engineer, Cloudera