You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by dl...@comcast.net on 2015/09/23 14:27:31 UTC

[ADVISORY] Possible data loss during HDFS decommissioning

BLUF: There exists the possibility of data loss when performing DataNode decommissioning with Accumulo running. This note applies to installations of Accumulo 1.5.0+ and Hadoop 2.5.0+. 

DETAILS: During DataNode decommissioning it is possible for the NameNode to report stale block locations (HDFS-8208). If Accumulo is running during this process then it is possible that files currently being written will not close properly. Accumulo is affected in two ways: 

1. During compactions temporary rfiles are created, then closed, and renamed. If a failure happens during the close, the compaction will fail. 
2. Write ahead log files are created, written to, and then closed. If a failure happens during the close, then the NameNode will have a walog file with no finalized blocks. 

If either of these cases happen, decommissioning of the DataNode could hang (HDFS-3599, HDFS-5579) because the files are left in an open for write state. If Accumulo needs the write ahead log for recovery it will be unable to read the file and will not recover. 

RECOMMENDATION: Assuming that the replication pipeline for the write ahead log is working properly, then you should not run into this issue if you only decommission one rack at a time. 

Re: [ADVISORY] Possible data loss during HDFS decommissioning

Posted by Josh Elser <jo...@gmail.com>.
-cc user@ (figure I'm forking this into a more dev-focused question now)

True, we don't have procedures for retroactively changing docs. I guess 
JIRA essentially acts as this version-affected discovery mechanism for 
us. People generally seem to understand a search of JIRA to find known 
issues too.

My only worry of creating a page on the website is that it's yet-another 
place people have to search to get the details on some operational 
subject. We've been doing well (since 1.6) to capture details like this 
in the user manual, so I figured this would also make sense to mention 
there. Perhaps multiple places is reasonable too?

dlmarion@comcast.net wrote:
> Known issue in the release notes on the web page? We would have to
> update every version though. Seems like we need a known issues document
> that lists issues in dependencies that transcend Accumulo versions.
>
> ------------------------------------------------------------------------
> *From: *"Josh Elser" <jo...@gmail.com>
> *To: *dev@accumulo.apache.org
> *Cc: *user@accumulo.apache.org
> *Sent: *Wednesday, September 23, 2015 10:26:50 AM
> *Subject: *Re: [ADVISORY] Possible data loss during HDFS decommissioning
>
> What kind of documentation can we put in the user manual about this?
> Recommend to only decom one rack at a time until we get the issue sorted
> out in Hadoop-land?
>
> dlmarion@comcast.net wrote:
>  > BLUF: There exists the possibility of data loss when performing
> DataNode decommissioning with Accumulo running. This note applies to
> installations of Accumulo 1.5.0+ and Hadoop 2.5.0+.
>  >
>  > DETAILS: During DataNode decommissioning it is possible for the
> NameNode to report stale block locations (HDFS-8208). If Accumulo is
> running during this process then it is possible that files currently
> being written will not close properly. Accumulo is affected in two ways:
>  >
>  > 1. During compactions temporary rfiles are created, then closed, and
> renamed. If a failure happens during the close, the compaction will fail.
>  > 2. Write ahead log files are created, written to, and then closed. If
> a failure happens during the close, then the NameNode will have a walog
> file with no finalized blocks.
>  >
>  > If either of these cases happen, decommissioning of the DataNode
> could hang (HDFS-3599, HDFS-5579) because the files are left in an open
> for write state. If Accumulo needs the write ahead log for recovery it
> will be unable to read the file and will not recover.
>  >
>  > RECOMMENDATION: Assuming that the replication pipeline for the write
> ahead log is working properly, then you should not run into this issue
> if you only decommission one rack at a time.
>  >
>

Re: [ADVISORY] Possible data loss during HDFS decommissioning

Posted by dl...@comcast.net.
Known issue in the release notes on the web page? We would have to update every version though. Seems like we need a known issues document that lists issues in dependencies that transcend Accumulo versions. 

----- Original Message -----

From: "Josh Elser" <jo...@gmail.com> 
To: dev@accumulo.apache.org 
Cc: user@accumulo.apache.org 
Sent: Wednesday, September 23, 2015 10:26:50 AM 
Subject: Re: [ADVISORY] Possible data loss during HDFS decommissioning 

What kind of documentation can we put in the user manual about this? 
Recommend to only decom one rack at a time until we get the issue sorted 
out in Hadoop-land? 

dlmarion@comcast.net wrote: 
> BLUF: There exists the possibility of data loss when performing DataNode decommissioning with Accumulo running. This note applies to installations of Accumulo 1.5.0+ and Hadoop 2.5.0+. 
> 
> DETAILS: During DataNode decommissioning it is possible for the NameNode to report stale block locations (HDFS-8208). If Accumulo is running during this process then it is possible that files currently being written will not close properly. Accumulo is affected in two ways: 
> 
> 1. During compactions temporary rfiles are created, then closed, and renamed. If a failure happens during the close, the compaction will fail. 
> 2. Write ahead log files are created, written to, and then closed. If a failure happens during the close, then the NameNode will have a walog file with no finalized blocks. 
> 
> If either of these cases happen, decommissioning of the DataNode could hang (HDFS-3599, HDFS-5579) because the files are left in an open for write state. If Accumulo needs the write ahead log for recovery it will be unable to read the file and will not recover. 
> 
> RECOMMENDATION: Assuming that the replication pipeline for the write ahead log is working properly, then you should not run into this issue if you only decommission one rack at a time. 
> 


Re: [ADVISORY] Possible data loss during HDFS decommissioning

Posted by dl...@comcast.net.
Known issue in the release notes on the web page? We would have to update every version though. Seems like we need a known issues document that lists issues in dependencies that transcend Accumulo versions. 

----- Original Message -----

From: "Josh Elser" <jo...@gmail.com> 
To: dev@accumulo.apache.org 
Cc: user@accumulo.apache.org 
Sent: Wednesday, September 23, 2015 10:26:50 AM 
Subject: Re: [ADVISORY] Possible data loss during HDFS decommissioning 

What kind of documentation can we put in the user manual about this? 
Recommend to only decom one rack at a time until we get the issue sorted 
out in Hadoop-land? 

dlmarion@comcast.net wrote: 
> BLUF: There exists the possibility of data loss when performing DataNode decommissioning with Accumulo running. This note applies to installations of Accumulo 1.5.0+ and Hadoop 2.5.0+. 
> 
> DETAILS: During DataNode decommissioning it is possible for the NameNode to report stale block locations (HDFS-8208). If Accumulo is running during this process then it is possible that files currently being written will not close properly. Accumulo is affected in two ways: 
> 
> 1. During compactions temporary rfiles are created, then closed, and renamed. If a failure happens during the close, the compaction will fail. 
> 2. Write ahead log files are created, written to, and then closed. If a failure happens during the close, then the NameNode will have a walog file with no finalized blocks. 
> 
> If either of these cases happen, decommissioning of the DataNode could hang (HDFS-3599, HDFS-5579) because the files are left in an open for write state. If Accumulo needs the write ahead log for recovery it will be unable to read the file and will not recover. 
> 
> RECOMMENDATION: Assuming that the replication pipeline for the write ahead log is working properly, then you should not run into this issue if you only decommission one rack at a time. 
> 


Re: [ADVISORY] Possible data loss during HDFS decommissioning

Posted by Josh Elser <jo...@gmail.com>.
What kind of documentation can we put in the user manual about this? 
Recommend to only decom one rack at a time until we get the issue sorted 
out in Hadoop-land?

dlmarion@comcast.net wrote:
> BLUF: There exists the possibility of data loss when performing DataNode decommissioning with Accumulo running. This note applies to installations of Accumulo 1.5.0+ and Hadoop 2.5.0+.
>
> DETAILS: During DataNode decommissioning it is possible for the NameNode to report stale block locations (HDFS-8208). If Accumulo is running during this process then it is possible that files currently being written will not close properly. Accumulo is affected in two ways:
>
> 1. During compactions temporary rfiles are created, then closed, and renamed. If a failure happens during the close, the compaction will fail.
> 2. Write ahead log files are created, written to, and then closed. If a failure happens during the close, then the NameNode will have a walog file with no finalized blocks.
>
> If either of these cases happen, decommissioning of the DataNode could hang (HDFS-3599, HDFS-5579) because the files are left in an open for write state. If Accumulo needs the write ahead log for recovery it will be unable to read the file and will not recover.
>
> RECOMMENDATION: Assuming that the replication pipeline for the write ahead log is working properly, then you should not run into this issue if you only decommission one rack at a time.
>

Re: [ADVISORY] Possible data loss during HDFS decommissioning

Posted by Josh Elser <jo...@gmail.com>.
What kind of documentation can we put in the user manual about this? 
Recommend to only decom one rack at a time until we get the issue sorted 
out in Hadoop-land?

dlmarion@comcast.net wrote:
> BLUF: There exists the possibility of data loss when performing DataNode decommissioning with Accumulo running. This note applies to installations of Accumulo 1.5.0+ and Hadoop 2.5.0+.
>
> DETAILS: During DataNode decommissioning it is possible for the NameNode to report stale block locations (HDFS-8208). If Accumulo is running during this process then it is possible that files currently being written will not close properly. Accumulo is affected in two ways:
>
> 1. During compactions temporary rfiles are created, then closed, and renamed. If a failure happens during the close, the compaction will fail.
> 2. Write ahead log files are created, written to, and then closed. If a failure happens during the close, then the NameNode will have a walog file with no finalized blocks.
>
> If either of these cases happen, decommissioning of the DataNode could hang (HDFS-3599, HDFS-5579) because the files are left in an open for write state. If Accumulo needs the write ahead log for recovery it will be unable to read the file and will not recover.
>
> RECOMMENDATION: Assuming that the replication pipeline for the write ahead log is working properly, then you should not run into this issue if you only decommission one rack at a time.
>