You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by sam liu <sa...@gmail.com> on 2015/01/24 14:31:17 UTC

Questions on rollback/upgrade HDFS with QJM HA enabled

Hi Experts,

I have questions on rollback/upgrade HDFS with QJM HA enabled.

On the website
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#HDFS_UpgradeFinalizationRollback_with_HA_Enabled,
it says:
'To perform a rollback of an upgrade, both NNs should first be shut down.
The operator should run the roll back command on the NN where they
initiated the upgrade procedure, which will perform the rollback on the
local dirs there, as well as on the shared log, either NFS or on the JNs.
Afterward, this NN should be started and the operator should run
`-bootstrapStandby' on the other NN to bring the two NNs in sync with this
rolled-back file system state.'

Currently I expect the steps are(Please correct me if I am wrong):
NN1 -> hadoop namenode -rollback
NN1 -> hadoop namenode // In our env, this rollbacked namenode shuts down
right after it finishes -rollback so it needs to be started again.
NN2 -> hadoop namenode -bootstrapStandby
hadoop datanode -rollback // on all datanodes

[Question 1]:
One thing I don't know is when the JournalNodes should be started and/or
stopped. It seems like they should be started for the hadoop namenode
-rollback. Should they be restarted sometime?

[Question 2]:
Another issue actually happens after the upgrade and before rollback
starts: The standby NN process is actually heavily occupying the CPU and
somehow is eating up disk space (without the disk space actually being
used). This was causing "No space left on device" errors during the
rollback process. As soon as I killed the namenode process, the disk space
was immediately back to a reasonable amount.
What might cause the NN process to occupy in a hidden way so much disk
space?

Thanks!

Re: Questions on rollback/upgrade HDFS with QJM HA enabled

Posted by sam liu <sa...@gmail.com>.

For HDFS rollback with QJM HA enabled, I tried following steps, but failed:

0. Stop the whole Hadoop cluster
1. Update env parameters to use old Hadoop binaries
2. Start JNs:
sudo -u hdfs $HADOOP_HOME/sbin/hadoop-daemon.sh --config "$HADOOP_CONF_DIR"
start journalnode
3. Start the active NN with the '-rollback' flag:
sudo -u hdfs $HADOOP_HOME/bin/hadoop namenode -rollback

Note:
- This step passed, however the active NN stopped automatically. The msg is:
15/01/25 21:57:48 INFO namenode.FSImage: Rolling back storage directory
/hadoop/hdfs/name.
   new LV = -56; new CTime = 0
15/01/25 21:57:48 INFO namenode.NNUpgradeUtil: Rollback of
/hadoop/hdfs/name is complete.
15/01/25 21:57:48 INFO util.ExitUtil: Exiting with status 0
15/01/25 21:57:48 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at bdvs1194.vmware.com/9.30.249.194
************************************************************/
4. Start the active NN again:
sudo -u hdfs $HADOOP_HOME/sbin/hadoop-daemon.sh --config "$HADOOP_CONF_DIR"
start namenode
Note:
Failed to start NN again, the error msg is:
2015-01-25 21:59:06,745 ERROR
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: caught exception
initializing
http://hostname:8480/getJournal?jid=BICluster&segmentTxId=4924&storageInfo=-56%3A304881993%3A0%3ACID-652befc6-4b10-4b89-a6bc-a411af6ca4c8
org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException:
Fetch of
http://hostname:8480/getJournal?jid=BICluster&segmentTxId=4924&storageInfo=-56%3A304881993%3A0%3ACID-652befc6-4b10-4b89-a6bc-a411af6ca4c8
failed with status code 403
Response message:
This node has namespaceId '0 and clusterId '' but the requesting node
expected '304881993' and 'CID-652befc6-4b10-4b89-a6bc-a411af6ca4c8'
        at
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:472)
        at
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:460)
        at
java.security.AccessController.doPrivileged(AccessController.java:369)


2015-01-26 10:26 GMT+08:00 sam liu <sa...@gmail.com>:

> Could any expert please help answer the questions?
>
> Thanks in advance!
>
> 2015-01-24 21:31 GMT+08:00 sam liu <sa...@gmail.com>:
>
>> Hi Experts,
>>
>> I have questions on rollback/upgrade HDFS with QJM HA enabled.
>>
>> On the website
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#HDFS_UpgradeFinalizationRollback_with_HA_Enabled,
>> it says:
>> 'To perform a rollback of an upgrade, both NNs should first be shut down.
>> The operator should run the roll back command on the NN where they
>> initiated the upgrade procedure, which will perform the rollback on the
>> local dirs there, as well as on the shared log, either NFS or on the JNs.
>> Afterward, this NN should be started and the operator should run
>> `-bootstrapStandby' on the other NN to bring the two NNs in sync with this
>> rolled-back file system state.'
>>
>> Currently I expect the steps are(Please correct me if I am wrong):
>> NN1 -> hadoop namenode -rollback
>> NN1 -> hadoop namenode // In our env, this rollbacked namenode shuts down
>> right after it finishes -rollback so it needs to be started again.
>> NN2 -> hadoop namenode -bootstrapStandby
>> hadoop datanode -rollback // on all datanodes
>>
>> [Question 1]:
>> One thing I don't know is when the JournalNodes should be started and/or
>> stopped. It seems like they should be started for the hadoop namenode
>> -rollback. Should they be restarted sometime?
>>
>> [Question 2]:
>> Another issue actually happens after the upgrade and before rollback
>> starts: The standby NN process is actually heavily occupying the CPU and
>> somehow is eating up disk space (without the disk space actually being
>> used). This was causing "No space left on device" errors during the
>> rollback process.  As soon as I killed the namenode process, the disk space
>> was immediately back to a reasonable amount.
>> What might cause the NN process to occupy in a hidden way so much disk
>> space?
>>
>> Thanks!
>>
>
>

Re: Questions on rollback/upgrade HDFS with QJM HA enabled

Posted by sam liu <sa...@gmail.com>.

For HDFS rollback with QJM HA enabled, I tried following steps, but failed:

0. Stop the whole Hadoop cluster
1. Update env parameters to use old Hadoop binaries
2. Start JNs:
sudo -u hdfs $HADOOP_HOME/sbin/hadoop-daemon.sh --config "$HADOOP_CONF_DIR"
start journalnode
3. Start the active NN with the '-rollback' flag:
sudo -u hdfs $HADOOP_HOME/bin/hadoop namenode -rollback

Note:
- This step passed, however the active NN stopped automatically. The msg is:
15/01/25 21:57:48 INFO namenode.FSImage: Rolling back storage directory
/hadoop/hdfs/name.
   new LV = -56; new CTime = 0
15/01/25 21:57:48 INFO namenode.NNUpgradeUtil: Rollback of
/hadoop/hdfs/name is complete.
15/01/25 21:57:48 INFO util.ExitUtil: Exiting with status 0
15/01/25 21:57:48 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at bdvs1194.vmware.com/9.30.249.194
************************************************************/
4. Start the active NN again:
sudo -u hdfs $HADOOP_HOME/sbin/hadoop-daemon.sh --config "$HADOOP_CONF_DIR"
start namenode
Note:
Failed to start NN again, the error msg is:
2015-01-25 21:59:06,745 ERROR
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: caught exception
initializing
http://hostname:8480/getJournal?jid=BICluster&segmentTxId=4924&storageInfo=-56%3A304881993%3A0%3ACID-652befc6-4b10-4b89-a6bc-a411af6ca4c8
org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException:
Fetch of
http://hostname:8480/getJournal?jid=BICluster&segmentTxId=4924&storageInfo=-56%3A304881993%3A0%3ACID-652befc6-4b10-4b89-a6bc-a411af6ca4c8
failed with status code 403
Response message:
This node has namespaceId '0 and clusterId '' but the requesting node
expected '304881993' and 'CID-652befc6-4b10-4b89-a6bc-a411af6ca4c8'
        at
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:472)
        at
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:460)
        at
java.security.AccessController.doPrivileged(AccessController.java:369)


2015-01-26 10:26 GMT+08:00 sam liu <sa...@gmail.com>:

> Could any expert please help answer the questions?
>
> Thanks in advance!
>
> 2015-01-24 21:31 GMT+08:00 sam liu <sa...@gmail.com>:
>
>> Hi Experts,
>>
>> I have questions on rollback/upgrade HDFS with QJM HA enabled.
>>
>> On the website
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#HDFS_UpgradeFinalizationRollback_with_HA_Enabled,
>> it says:
>> 'To perform a rollback of an upgrade, both NNs should first be shut down.
>> The operator should run the roll back command on the NN where they
>> initiated the upgrade procedure, which will perform the rollback on the
>> local dirs there, as well as on the shared log, either NFS or on the JNs.
>> Afterward, this NN should be started and the operator should run
>> `-bootstrapStandby' on the other NN to bring the two NNs in sync with this
>> rolled-back file system state.'
>>
>> Currently I expect the steps are(Please correct me if I am wrong):
>> NN1 -> hadoop namenode -rollback
>> NN1 -> hadoop namenode // In our env, this rollbacked namenode shuts down
>> right after it finishes -rollback so it needs to be started again.
>> NN2 -> hadoop namenode -bootstrapStandby
>> hadoop datanode -rollback // on all datanodes
>>
>> [Question 1]:
>> One thing I don't know is when the JournalNodes should be started and/or
>> stopped. It seems like they should be started for the hadoop namenode
>> -rollback. Should they be restarted sometime?
>>
>> [Question 2]:
>> Another issue actually happens after the upgrade and before rollback
>> starts: The standby NN process is actually heavily occupying the CPU and
>> somehow is eating up disk space (without the disk space actually being
>> used). This was causing "No space left on device" errors during the
>> rollback process.  As soon as I killed the namenode process, the disk space
>> was immediately back to a reasonable amount.
>> What might cause the NN process to occupy in a hidden way so much disk
>> space?
>>
>> Thanks!
>>
>
>

Re: Questions on rollback/upgrade HDFS with QJM HA enabled

Posted by sam liu <sa...@gmail.com>.

For HDFS rollback with QJM HA enabled, I tried following steps, but failed:

0. Stop the whole Hadoop cluster
1. Update env parameters to use old Hadoop binaries
2. Start JNs:
sudo -u hdfs $HADOOP_HOME/sbin/hadoop-daemon.sh --config "$HADOOP_CONF_DIR"
start journalnode
3. Start the active NN with the '-rollback' flag:
sudo -u hdfs $HADOOP_HOME/bin/hadoop namenode -rollback

Note:
- This step passed, however the active NN stopped automatically. The msg is:
15/01/25 21:57:48 INFO namenode.FSImage: Rolling back storage directory
/hadoop/hdfs/name.
   new LV = -56; new CTime = 0
15/01/25 21:57:48 INFO namenode.NNUpgradeUtil: Rollback of
/hadoop/hdfs/name is complete.
15/01/25 21:57:48 INFO util.ExitUtil: Exiting with status 0
15/01/25 21:57:48 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at bdvs1194.vmware.com/9.30.249.194
************************************************************/
4. Start the active NN again:
sudo -u hdfs $HADOOP_HOME/sbin/hadoop-daemon.sh --config "$HADOOP_CONF_DIR"
start namenode
Note:
Failed to start NN again, the error msg is:
2015-01-25 21:59:06,745 ERROR
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: caught exception
initializing
http://hostname:8480/getJournal?jid=BICluster&segmentTxId=4924&storageInfo=-56%3A304881993%3A0%3ACID-652befc6-4b10-4b89-a6bc-a411af6ca4c8
org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException:
Fetch of
http://hostname:8480/getJournal?jid=BICluster&segmentTxId=4924&storageInfo=-56%3A304881993%3A0%3ACID-652befc6-4b10-4b89-a6bc-a411af6ca4c8
failed with status code 403
Response message:
This node has namespaceId '0 and clusterId '' but the requesting node
expected '304881993' and 'CID-652befc6-4b10-4b89-a6bc-a411af6ca4c8'
        at
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:472)
        at
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:460)
        at
java.security.AccessController.doPrivileged(AccessController.java:369)


2015-01-26 10:26 GMT+08:00 sam liu <sa...@gmail.com>:

> Could any expert please help answer the questions?
>
> Thanks in advance!
>
> 2015-01-24 21:31 GMT+08:00 sam liu <sa...@gmail.com>:
>
>> Hi Experts,
>>
>> I have questions on rollback/upgrade HDFS with QJM HA enabled.
>>
>> On the website
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#HDFS_UpgradeFinalizationRollback_with_HA_Enabled,
>> it says:
>> 'To perform a rollback of an upgrade, both NNs should first be shut down.
>> The operator should run the roll back command on the NN where they
>> initiated the upgrade procedure, which will perform the rollback on the
>> local dirs there, as well as on the shared log, either NFS or on the JNs.
>> Afterward, this NN should be started and the operator should run
>> `-bootstrapStandby' on the other NN to bring the two NNs in sync with this
>> rolled-back file system state.'
>>
>> Currently I expect the steps are(Please correct me if I am wrong):
>> NN1 -> hadoop namenode -rollback
>> NN1 -> hadoop namenode // In our env, this rollbacked namenode shuts down
>> right after it finishes -rollback so it needs to be started again.
>> NN2 -> hadoop namenode -bootstrapStandby
>> hadoop datanode -rollback // on all datanodes
>>
>> [Question 1]:
>> One thing I don't know is when the JournalNodes should be started and/or
>> stopped. It seems like they should be started for the hadoop namenode
>> -rollback. Should they be restarted sometime?
>>
>> [Question 2]:
>> Another issue actually happens after the upgrade and before rollback
>> starts: The standby NN process is actually heavily occupying the CPU and
>> somehow is eating up disk space (without the disk space actually being
>> used). This was causing "No space left on device" errors during the
>> rollback process.  As soon as I killed the namenode process, the disk space
>> was immediately back to a reasonable amount.
>> What might cause the NN process to occupy in a hidden way so much disk
>> space?
>>
>> Thanks!
>>
>
>

Re: Questions on rollback/upgrade HDFS with QJM HA enabled

Posted by sam liu <sa...@gmail.com>.

For HDFS rollback with QJM HA enabled, I tried following steps, but failed:

0. Stop the whole Hadoop cluster
1. Update env parameters to use old Hadoop binaries
2. Start JNs:
sudo -u hdfs $HADOOP_HOME/sbin/hadoop-daemon.sh --config "$HADOOP_CONF_DIR"
start journalnode
3. Start the active NN with the '-rollback' flag:
sudo -u hdfs $HADOOP_HOME/bin/hadoop namenode -rollback

Note:
- This step passed, however the active NN stopped automatically. The msg is:
15/01/25 21:57:48 INFO namenode.FSImage: Rolling back storage directory
/hadoop/hdfs/name.
   new LV = -56; new CTime = 0
15/01/25 21:57:48 INFO namenode.NNUpgradeUtil: Rollback of
/hadoop/hdfs/name is complete.
15/01/25 21:57:48 INFO util.ExitUtil: Exiting with status 0
15/01/25 21:57:48 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at bdvs1194.vmware.com/9.30.249.194
************************************************************/
4. Start the active NN again:
sudo -u hdfs $HADOOP_HOME/sbin/hadoop-daemon.sh --config "$HADOOP_CONF_DIR"
start namenode
Note:
Failed to start NN again, the error msg is:
2015-01-25 21:59:06,745 ERROR
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: caught exception
initializing
http://hostname:8480/getJournal?jid=BICluster&segmentTxId=4924&storageInfo=-56%3A304881993%3A0%3ACID-652befc6-4b10-4b89-a6bc-a411af6ca4c8
org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException:
Fetch of
http://hostname:8480/getJournal?jid=BICluster&segmentTxId=4924&storageInfo=-56%3A304881993%3A0%3ACID-652befc6-4b10-4b89-a6bc-a411af6ca4c8
failed with status code 403
Response message:
This node has namespaceId '0 and clusterId '' but the requesting node
expected '304881993' and 'CID-652befc6-4b10-4b89-a6bc-a411af6ca4c8'
        at
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:472)
        at
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:460)
        at
java.security.AccessController.doPrivileged(AccessController.java:369)


2015-01-26 10:26 GMT+08:00 sam liu <sa...@gmail.com>:

> Could any expert please help answer the questions?
>
> Thanks in advance!
>
> 2015-01-24 21:31 GMT+08:00 sam liu <sa...@gmail.com>:
>
>> Hi Experts,
>>
>> I have questions on rollback/upgrade HDFS with QJM HA enabled.
>>
>> On the website
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#HDFS_UpgradeFinalizationRollback_with_HA_Enabled,
>> it says:
>> 'To perform a rollback of an upgrade, both NNs should first be shut down.
>> The operator should run the roll back command on the NN where they
>> initiated the upgrade procedure, which will perform the rollback on the
>> local dirs there, as well as on the shared log, either NFS or on the JNs.
>> Afterward, this NN should be started and the operator should run
>> `-bootstrapStandby' on the other NN to bring the two NNs in sync with this
>> rolled-back file system state.'
>>
>> Currently I expect the steps are(Please correct me if I am wrong):
>> NN1 -> hadoop namenode -rollback
>> NN1 -> hadoop namenode // In our env, this rollbacked namenode shuts down
>> right after it finishes -rollback so it needs to be started again.
>> NN2 -> hadoop namenode -bootstrapStandby
>> hadoop datanode -rollback // on all datanodes
>>
>> [Question 1]:
>> One thing I don't know is when the JournalNodes should be started and/or
>> stopped. It seems like they should be started for the hadoop namenode
>> -rollback. Should they be restarted sometime?
>>
>> [Question 2]:
>> Another issue actually happens after the upgrade and before rollback
>> starts: The standby NN process is actually heavily occupying the CPU and
>> somehow is eating up disk space (without the disk space actually being
>> used). This was causing "No space left on device" errors during the
>> rollback process.  As soon as I killed the namenode process, the disk space
>> was immediately back to a reasonable amount.
>> What might cause the NN process to occupy in a hidden way so much disk
>> space?
>>
>> Thanks!
>>
>
>

Re: Questions on rollback/upgrade HDFS with QJM HA enabled

Posted by sam liu <sa...@gmail.com>.

Could any expert please help answer the questions?

Thanks in advance!

2015-01-24 21:31 GMT+08:00 sam liu <sa...@gmail.com>:

> Hi Experts,
>
> I have questions on rollback/upgrade HDFS with QJM HA enabled.
>
> On the website
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#HDFS_UpgradeFinalizationRollback_with_HA_Enabled,
> it says:
> 'To perform a rollback of an upgrade, both NNs should first be shut down.
> The operator should run the roll back command on the NN where they
> initiated the upgrade procedure, which will perform the rollback on the
> local dirs there, as well as on the shared log, either NFS or on the JNs.
> Afterward, this NN should be started and the operator should run
> `-bootstrapStandby' on the other NN to bring the two NNs in sync with this
> rolled-back file system state.'
>
> Currently I expect the steps are(Please correct me if I am wrong):
> NN1 -> hadoop namenode -rollback
> NN1 -> hadoop namenode // In our env, this rollbacked namenode shuts down
> right after it finishes -rollback so it needs to be started again.
> NN2 -> hadoop namenode -bootstrapStandby
> hadoop datanode -rollback // on all datanodes
>
> [Question 1]:
> One thing I don't know is when the JournalNodes should be started and/or
> stopped. It seems like they should be started for the hadoop namenode
> -rollback. Should they be restarted sometime?
>
> [Question 2]:
> Another issue actually happens after the upgrade and before rollback
> starts: The standby NN process is actually heavily occupying the CPU and
> somehow is eating up disk space (without the disk space actually being
> used). This was causing "No space left on device" errors during the
> rollback process.  As soon as I killed the namenode process, the disk space
> was immediately back to a reasonable amount.
> What might cause the NN process to occupy in a hidden way so much disk
> space?
>
> Thanks!
>

Re: Questions on rollback/upgrade HDFS with QJM HA enabled

Posted by sam liu <sa...@gmail.com>.

Could any expert please help answer the questions?

Thanks in advance!

2015-01-24 21:31 GMT+08:00 sam liu <sa...@gmail.com>:

> Hi Experts,
>
> I have questions on rollback/upgrade HDFS with QJM HA enabled.
>
> On the website
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#HDFS_UpgradeFinalizationRollback_with_HA_Enabled,
> it says:
> 'To perform a rollback of an upgrade, both NNs should first be shut down.
> The operator should run the roll back command on the NN where they
> initiated the upgrade procedure, which will perform the rollback on the
> local dirs there, as well as on the shared log, either NFS or on the JNs.
> Afterward, this NN should be started and the operator should run
> `-bootstrapStandby' on the other NN to bring the two NNs in sync with this
> rolled-back file system state.'
>
> Currently I expect the steps are(Please correct me if I am wrong):
> NN1 -> hadoop namenode -rollback
> NN1 -> hadoop namenode // In our env, this rollbacked namenode shuts down
> right after it finishes -rollback so it needs to be started again.
> NN2 -> hadoop namenode -bootstrapStandby
> hadoop datanode -rollback // on all datanodes
>
> [Question 1]:
> One thing I don't know is when the JournalNodes should be started and/or
> stopped. It seems like they should be started for the hadoop namenode
> -rollback. Should they be restarted sometime?
>
> [Question 2]:
> Another issue actually happens after the upgrade and before rollback
> starts: The standby NN process is actually heavily occupying the CPU and
> somehow is eating up disk space (without the disk space actually being
> used). This was causing "No space left on device" errors during the
> rollback process.  As soon as I killed the namenode process, the disk space
> was immediately back to a reasonable amount.
> What might cause the NN process to occupy in a hidden way so much disk
> space?
>
> Thanks!
>

Re: Questions on rollback/upgrade HDFS with QJM HA enabled

Posted by sam liu <sa...@gmail.com>.

Could any expert please help answer the questions?

Thanks in advance!

2015-01-24 21:31 GMT+08:00 sam liu <sa...@gmail.com>:

> Hi Experts,
>
> I have questions on rollback/upgrade HDFS with QJM HA enabled.
>
> On the website
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#HDFS_UpgradeFinalizationRollback_with_HA_Enabled,
> it says:
> 'To perform a rollback of an upgrade, both NNs should first be shut down.
> The operator should run the roll back command on the NN where they
> initiated the upgrade procedure, which will perform the rollback on the
> local dirs there, as well as on the shared log, either NFS or on the JNs.
> Afterward, this NN should be started and the operator should run
> `-bootstrapStandby' on the other NN to bring the two NNs in sync with this
> rolled-back file system state.'
>
> Currently I expect the steps are(Please correct me if I am wrong):
> NN1 -> hadoop namenode -rollback
> NN1 -> hadoop namenode // In our env, this rollbacked namenode shuts down
> right after it finishes -rollback so it needs to be started again.
> NN2 -> hadoop namenode -bootstrapStandby
> hadoop datanode -rollback // on all datanodes
>
> [Question 1]:
> One thing I don't know is when the JournalNodes should be started and/or
> stopped. It seems like they should be started for the hadoop namenode
> -rollback. Should they be restarted sometime?
>
> [Question 2]:
> Another issue actually happens after the upgrade and before rollback
> starts: The standby NN process is actually heavily occupying the CPU and
> somehow is eating up disk space (without the disk space actually being
> used). This was causing "No space left on device" errors during the
> rollback process.  As soon as I killed the namenode process, the disk space
> was immediately back to a reasonable amount.
> What might cause the NN process to occupy in a hidden way so much disk
> space?
>
> Thanks!
>

Re: Questions on rollback/upgrade HDFS with QJM HA enabled

Posted by sam liu <sa...@gmail.com>.

Could any expert please help answer the questions?

Thanks in advance!

2015-01-24 21:31 GMT+08:00 sam liu <sa...@gmail.com>:

> Hi Experts,
>
> I have questions on rollback/upgrade HDFS with QJM HA enabled.
>
> On the website
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#HDFS_UpgradeFinalizationRollback_with_HA_Enabled,
> it says:
> 'To perform a rollback of an upgrade, both NNs should first be shut down.
> The operator should run the roll back command on the NN where they
> initiated the upgrade procedure, which will perform the rollback on the
> local dirs there, as well as on the shared log, either NFS or on the JNs.
> Afterward, this NN should be started and the operator should run
> `-bootstrapStandby' on the other NN to bring the two NNs in sync with this
> rolled-back file system state.'
>
> Currently I expect the steps are(Please correct me if I am wrong):
> NN1 -> hadoop namenode -rollback
> NN1 -> hadoop namenode // In our env, this rollbacked namenode shuts down
> right after it finishes -rollback so it needs to be started again.
> NN2 -> hadoop namenode -bootstrapStandby
> hadoop datanode -rollback // on all datanodes
>
> [Question 1]:
> One thing I don't know is when the JournalNodes should be started and/or
> stopped. It seems like they should be started for the hadoop namenode
> -rollback. Should they be restarted sometime?
>
> [Question 2]:
> Another issue actually happens after the upgrade and before rollback
> starts: The standby NN process is actually heavily occupying the CPU and
> somehow is eating up disk space (without the disk space actually being
> used). This was causing "No space left on device" errors during the
> rollback process.  As soon as I killed the namenode process, the disk space
> was immediately back to a reasonable amount.
> What might cause the NN process to occupy in a hidden way so much disk
> space?
>
> Thanks!
>