You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Zesheng Wu <wu...@gmail.com> on 2014/09/09 12:15:30 UTC

HDFS: Couldn't obtain the locations of the last block

Hi,

These days we encountered a critical bug in HDFS which can result in HBase
can't start normally.
The scenario is like following:
1.  rs1 writes data to HDFS file f1, and the first block is written
successfully
2.  rs1 apply to create the second block successfully, at this time,
nn1(ann) is crashed due to writing journal timeout
3. nn2(snn) isn't become active because of zkfc2 is in abnormal state
4. nn1 is restarted and becomes active
5. During the process of nn1 restarting, rs1 is crashed due to writing to
safemode nn(nn1)
6. As a result, the file f1 is in abnormal state and the HBase cluster
can't serve any more

We can use the command line shell to list the file, look like following:

-rw-------   3 hbase_srv supergroup  134217728 2014-09-05 11:32
/hbase/lgsrv-push/xxx

But when we try to download the file from hdfs, the dfs client complains:

14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not
available. Datanodes might not have reported blocks completely. Will
retry for 3 times
14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not
available. Datanodes might not have reported blocks completely. Will
retry for 2 times
14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not
available. Datanodes might not have reported blocks completely. Will
retry for 1 times
get: Could not obtain the last block locations.

Anyone can help on this?

-- 
Best Wishes!

Yours, Zesheng

RE: HDFS: Couldn't obtain the locations of the last block

Posted by "Liu, Yi A" <yi...@intel.com>.

That’s great.

Regards,
Yi Liu

From: Zesheng Wu [mailto:wuzesheng86@gmail.com]
Sent: Wednesday, September 10, 2014 8:25 PM
To: user@hadoop.apache.org
Subject: Re: HDFS: Couldn't obtain the locations of the last block

Hi Yi,

I went through HDFS-4516, and it really solves our problem, thanks very much!

2014-09-10 16:39 GMT+08:00 Zesheng Wu <wu...@gmail.com>>:
Thanks Yi, I will look into HDFS-4516.

2014-09-10 15:03 GMT+08:00 Liu, Yi A <yi...@intel.com>>:

Hi Zesheng,

I got from an offline email of you and knew your Hadoop version was 2.0.0-alpha and you also said “The block is allocated successfully in NN, but isn’t created in DN”.
Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is similar with HDFS-4516.   And can you try Hadoop 2.4 or later, you should not be able to re-produce it for these versions.

From your description, the second block is created successfully and NN would flush the edit log info to shared journal and shared storage might persist the info, but before reporting back in rpc, there might be timeout to NN from shared storage.  So the block exist in shared edit log, but DN doesn’t create it in anyway.  On restart, client could fail, because in that Hadoop version, client would retry only in the case of NN last block size reported as non-zero if it was synced (see more in HDFS-4516).

Regards,
Yi Liu

From: Zesheng Wu [mailto:wuzesheng86@gmail.com<ma...@gmail.com>]
Sent: Tuesday, September 09, 2014 6:16 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: HDFS: Couldn't obtain the locations of the last block

Hi,

These days we encountered a critical bug in HDFS which can result in HBase can't start normally.
The scenario is like following:
1.  rs1 writes data to HDFS file f1, and the first block is written successfully
2.  rs1 apply to create the second block successfully, at this time, nn1(ann) is crashed due to writing journal timeout
3. nn2(snn) isn't become active because of zkfc2 is in abnormal state
4. nn1 is restarted and becomes active
5. During the process of nn1 restarting, rs1 is crashed due to writing to safemode nn(nn1)
6. As a result, the file f1 is in abnormal state and the HBase cluster can't serve any more

We can use the command line shell to list the file, look like following:

-rw-------   3 hbase_srv supergroup  134217728 2014-09-05 11:32 /hbase/lgsrv-push/xxx
But when we try to download the file from hdfs, the dfs client complains:

14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 3 times

14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 2 times

14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 1 times

get: Could not obtain the last block locations.

Anyone can help on this?
--
Best Wishes!

Yours, Zesheng

--
Best Wishes!

Yours, Zesheng

--
Best Wishes!

Yours, Zesheng

RE: HDFS: Couldn't obtain the locations of the last block

Posted by "Liu, Yi A" <yi...@intel.com>.

That’s great.

Regards,
Yi Liu

From: Zesheng Wu [mailto:wuzesheng86@gmail.com]
Sent: Wednesday, September 10, 2014 8:25 PM
To: user@hadoop.apache.org
Subject: Re: HDFS: Couldn't obtain the locations of the last block

Hi Yi,

I went through HDFS-4516, and it really solves our problem, thanks very much!

2014-09-10 16:39 GMT+08:00 Zesheng Wu <wu...@gmail.com>>:
Thanks Yi, I will look into HDFS-4516.

2014-09-10 15:03 GMT+08:00 Liu, Yi A <yi...@intel.com>>:

Hi Zesheng,

I got from an offline email of you and knew your Hadoop version was 2.0.0-alpha and you also said “The block is allocated successfully in NN, but isn’t created in DN”.
Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is similar with HDFS-4516.   And can you try Hadoop 2.4 or later, you should not be able to re-produce it for these versions.

From your description, the second block is created successfully and NN would flush the edit log info to shared journal and shared storage might persist the info, but before reporting back in rpc, there might be timeout to NN from shared storage.  So the block exist in shared edit log, but DN doesn’t create it in anyway.  On restart, client could fail, because in that Hadoop version, client would retry only in the case of NN last block size reported as non-zero if it was synced (see more in HDFS-4516).

Regards,
Yi Liu

From: Zesheng Wu [mailto:wuzesheng86@gmail.com<ma...@gmail.com>]
Sent: Tuesday, September 09, 2014 6:16 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: HDFS: Couldn't obtain the locations of the last block

Hi,

These days we encountered a critical bug in HDFS which can result in HBase can't start normally.
The scenario is like following:
1.  rs1 writes data to HDFS file f1, and the first block is written successfully
2.  rs1 apply to create the second block successfully, at this time, nn1(ann) is crashed due to writing journal timeout
3. nn2(snn) isn't become active because of zkfc2 is in abnormal state
4. nn1 is restarted and becomes active
5. During the process of nn1 restarting, rs1 is crashed due to writing to safemode nn(nn1)
6. As a result, the file f1 is in abnormal state and the HBase cluster can't serve any more

We can use the command line shell to list the file, look like following:

-rw-------   3 hbase_srv supergroup  134217728 2014-09-05 11:32 /hbase/lgsrv-push/xxx
But when we try to download the file from hdfs, the dfs client complains:

14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 3 times

14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 2 times

14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 1 times

get: Could not obtain the last block locations.

Anyone can help on this?
--
Best Wishes!

Yours, Zesheng

--
Best Wishes!

Yours, Zesheng

--
Best Wishes!

Yours, Zesheng

RE: HDFS: Couldn't obtain the locations of the last block

Posted by "Liu, Yi A" <yi...@intel.com>.

That’s great.

Regards,
Yi Liu

From: Zesheng Wu [mailto:wuzesheng86@gmail.com]
Sent: Wednesday, September 10, 2014 8:25 PM
To: user@hadoop.apache.org
Subject: Re: HDFS: Couldn't obtain the locations of the last block

Hi Yi,

I went through HDFS-4516, and it really solves our problem, thanks very much!

2014-09-10 16:39 GMT+08:00 Zesheng Wu <wu...@gmail.com>>:
Thanks Yi, I will look into HDFS-4516.

2014-09-10 15:03 GMT+08:00 Liu, Yi A <yi...@intel.com>>:

Hi Zesheng,

I got from an offline email of you and knew your Hadoop version was 2.0.0-alpha and you also said “The block is allocated successfully in NN, but isn’t created in DN”.
Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is similar with HDFS-4516.   And can you try Hadoop 2.4 or later, you should not be able to re-produce it for these versions.

From your description, the second block is created successfully and NN would flush the edit log info to shared journal and shared storage might persist the info, but before reporting back in rpc, there might be timeout to NN from shared storage.  So the block exist in shared edit log, but DN doesn’t create it in anyway.  On restart, client could fail, because in that Hadoop version, client would retry only in the case of NN last block size reported as non-zero if it was synced (see more in HDFS-4516).

Regards,
Yi Liu

From: Zesheng Wu [mailto:wuzesheng86@gmail.com<ma...@gmail.com>]
Sent: Tuesday, September 09, 2014 6:16 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: HDFS: Couldn't obtain the locations of the last block

Hi,

These days we encountered a critical bug in HDFS which can result in HBase can't start normally.
The scenario is like following:
1.  rs1 writes data to HDFS file f1, and the first block is written successfully
2.  rs1 apply to create the second block successfully, at this time, nn1(ann) is crashed due to writing journal timeout
3. nn2(snn) isn't become active because of zkfc2 is in abnormal state
4. nn1 is restarted and becomes active
5. During the process of nn1 restarting, rs1 is crashed due to writing to safemode nn(nn1)
6. As a result, the file f1 is in abnormal state and the HBase cluster can't serve any more

We can use the command line shell to list the file, look like following:

-rw-------   3 hbase_srv supergroup  134217728 2014-09-05 11:32 /hbase/lgsrv-push/xxx
But when we try to download the file from hdfs, the dfs client complains:

14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 3 times

14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 2 times

14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 1 times

get: Could not obtain the last block locations.

Anyone can help on this?
--
Best Wishes!

Yours, Zesheng

--
Best Wishes!

Yours, Zesheng

--
Best Wishes!

Yours, Zesheng