You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Colin Kincaid Williams <di...@uw.edu> on 2014/10/17 02:03:31 UTC

decommissioning disks on a data node

We have been seeing some of the disks on our cluster having bad blocks, and
then failing. We are using some dell PERC H700 disk controllers that create
"virtual devices".

Our hosting manager uses a dell utility which reports "virtual device bad
blocks". He has suggested that we use the dell tool to remove the "virtual
device bad blocks", and then re-format the device.

 I'm wondering if we can remove the disks in question from the
hdfs-site.xml, and restart the datanode , so that we don't re-replicate the
hadoop blocks on the other disks. Then we would go ahead and work on the
troubled disk, while the datanode remained up. Finally we would restart the
datanode again after re-adding the freshly formatted { possibly new } disk.
This way the data on the remaining disks doesn't get re-replicated.

I don't know too much about the hadoop block system. Will this work ? Is it
an acceptable strategy for disk maintenance ?

Re: decommissioning disks on a data node

Posted by Travis <hc...@ghostar.org>.

On Thu, Oct 16, 2014 at 11:41 PM, Colin Kincaid Williams <di...@uw.edu>
wrote:

>  Hi Travis,
>
> Thanks for your input. I forgot to mention that the drives are most likely
> in the single drive configuration that you describe.
>

Then clearing the virtual badblock list is unlikely to do anything useful
if the drive itself already has failed sectors.  One thing you can do is

/opt/srvadmin/bin/omreport storage pdisk controller=$controller_number

and check to see if any of them show as a non-critical state (as opposed to
Ok) or failed state.  If it's reporting non-critical, this means that OMSA
is treating it as going to fail in the near future.  There's a script
written to work as a nagios check called check_openmanage.  It has a nice
way of consolidating all of the information that OMSA reports on into a
useful and concise report of what's actually happening with your Dell
hardware.

It's available here:

http://folk.uio.no/trondham/software/check_openmanage.html

These two things will at least give you some indication on whether or not
you'll be replacing the drive in the near future.

>
> I think what I've found is that restarting the datanodes in the manner I
> describe shows that the mount points on the drives with the reset blocks
> and newly formatted partition have gone bad. Then I'm not sure the namenode
> will use these locations, even if it does not show the volumes failed.
> Without a way to reinitialize the disks, specifically the mount points, I
> assume my efforts are in vain.
>

So, there's really three ways that bad blocks can get dealt with:

1.  physical drive identifies a bad block via SMART and remaps it to a
reserved block.  There's a limited supply of these.  This happens
automatically within the drive's own firmware controller.
2.  virtual drive identifies a bad block via the raid controller and remaps
it somewhere, probably to a section of blocks reserved by the controller
for this.  Similar to #1.  In case of multi-disk raids, the bad block
identified that lives on disk A could potentially be remapped to a good
block on disk B because of the virtual disk block remapping.
3.  mkfs identifies bad blocks during filesystem creation and prevents data
from being written there.

I'm not sure what the actual recovery behavior is for #2 in the case of
single-disk raid0.

If #1 occurs, the drive should just be replaced.  If you absolutely can't
replace it, you can try doing #3 (assuming you use ext3/ext4; not sure
about how to do it with other filesystems), but don't be surprised if it
doesn't work or if you begin having mysterious data corruption as more and
more sectors fail on the disk.

> Therefore the only procedure that makes sense is to decommission the nodes
> with which I want to bring the failed volumes back up. It just didn't make
> sense to me that if we have a large number of disks with good data, that we
> would end up wiping that data and starting over again.
>
>
You don't need to decom the whole thing just to replace a single disk.

There is one case where doing a full decom is useful and that has to do
with the "risk" that the replacement drive can become a hotspot depending
on how you've configured the block placement policy that the datanode uses
to determine how to fill drives up.  In this particular case, if you have
hotspotted drives because the datanode is choosing to place all new blocks
on the new drive until it equalizes in usage compared to other drives in
the system, you could run into performance issues.  In practice, it
probably doesn't matter much.  "Fixing" the problem means doing a full
decommission, then adding the node back in and running a full cluster
rebalance with the Balancer.

There's an interesting discussion about intra-datanode block placement at
https://issues.apache.org/jira/browse/HDFS-1312.

In my cluster, we almost never go this route when replacing just one disk
in a system.  We have 12 disks in each node, so replacing 1 means only 9%
of the data *on that node* could potentially run into this.  And with the
size of our cluster that's somewhere below 0.1% of the data that could be
affected.  Just not worth worrying about it.

Anyhow, replace the disk.  You'll be a happier Hadoop user then. :-)

Travis
-- 
Travis Campbell
travis@ghostar.org

Re: decommissioning disks on a data node

Posted by Travis <hc...@ghostar.org>.

On Thu, Oct 16, 2014 at 11:41 PM, Colin Kincaid Williams <di...@uw.edu>
wrote:

>  Hi Travis,
>
> Thanks for your input. I forgot to mention that the drives are most likely
> in the single drive configuration that you describe.
>

Then clearing the virtual badblock list is unlikely to do anything useful
if the drive itself already has failed sectors.  One thing you can do is

/opt/srvadmin/bin/omreport storage pdisk controller=$controller_number

and check to see if any of them show as a non-critical state (as opposed to
Ok) or failed state.  If it's reporting non-critical, this means that OMSA
is treating it as going to fail in the near future.  There's a script
written to work as a nagios check called check_openmanage.  It has a nice
way of consolidating all of the information that OMSA reports on into a
useful and concise report of what's actually happening with your Dell
hardware.

It's available here:

http://folk.uio.no/trondham/software/check_openmanage.html

These two things will at least give you some indication on whether or not
you'll be replacing the drive in the near future.

>
> I think what I've found is that restarting the datanodes in the manner I
> describe shows that the mount points on the drives with the reset blocks
> and newly formatted partition have gone bad. Then I'm not sure the namenode
> will use these locations, even if it does not show the volumes failed.
> Without a way to reinitialize the disks, specifically the mount points, I
> assume my efforts are in vain.
>

So, there's really three ways that bad blocks can get dealt with:

1.  physical drive identifies a bad block via SMART and remaps it to a
reserved block.  There's a limited supply of these.  This happens
automatically within the drive's own firmware controller.
2.  virtual drive identifies a bad block via the raid controller and remaps
it somewhere, probably to a section of blocks reserved by the controller
for this.  Similar to #1.  In case of multi-disk raids, the bad block
identified that lives on disk A could potentially be remapped to a good
block on disk B because of the virtual disk block remapping.
3.  mkfs identifies bad blocks during filesystem creation and prevents data
from being written there.

I'm not sure what the actual recovery behavior is for #2 in the case of
single-disk raid0.

If #1 occurs, the drive should just be replaced.  If you absolutely can't
replace it, you can try doing #3 (assuming you use ext3/ext4; not sure
about how to do it with other filesystems), but don't be surprised if it
doesn't work or if you begin having mysterious data corruption as more and
more sectors fail on the disk.

> Therefore the only procedure that makes sense is to decommission the nodes
> with which I want to bring the failed volumes back up. It just didn't make
> sense to me that if we have a large number of disks with good data, that we
> would end up wiping that data and starting over again.
>
>
You don't need to decom the whole thing just to replace a single disk.

There is one case where doing a full decom is useful and that has to do
with the "risk" that the replacement drive can become a hotspot depending
on how you've configured the block placement policy that the datanode uses
to determine how to fill drives up.  In this particular case, if you have
hotspotted drives because the datanode is choosing to place all new blocks
on the new drive until it equalizes in usage compared to other drives in
the system, you could run into performance issues.  In practice, it
probably doesn't matter much.  "Fixing" the problem means doing a full
decommission, then adding the node back in and running a full cluster
rebalance with the Balancer.

There's an interesting discussion about intra-datanode block placement at
https://issues.apache.org/jira/browse/HDFS-1312.

In my cluster, we almost never go this route when replacing just one disk
in a system.  We have 12 disks in each node, so replacing 1 means only 9%
of the data *on that node* could potentially run into this.  And with the
size of our cluster that's somewhere below 0.1% of the data that could be
affected.  Just not worth worrying about it.

Anyhow, replace the disk.  You'll be a happier Hadoop user then. :-)

Travis
-- 
Travis Campbell
travis@ghostar.org

Re: decommissioning disks on a data node

Posted by Travis <hc...@ghostar.org>.

On Thu, Oct 16, 2014 at 11:41 PM, Colin Kincaid Williams <di...@uw.edu>
wrote:

>  Hi Travis,
>
> Thanks for your input. I forgot to mention that the drives are most likely
> in the single drive configuration that you describe.
>

Then clearing the virtual badblock list is unlikely to do anything useful
if the drive itself already has failed sectors.  One thing you can do is

/opt/srvadmin/bin/omreport storage pdisk controller=$controller_number

and check to see if any of them show as a non-critical state (as opposed to
Ok) or failed state.  If it's reporting non-critical, this means that OMSA
is treating it as going to fail in the near future.  There's a script
written to work as a nagios check called check_openmanage.  It has a nice
way of consolidating all of the information that OMSA reports on into a
useful and concise report of what's actually happening with your Dell
hardware.

It's available here:

http://folk.uio.no/trondham/software/check_openmanage.html

These two things will at least give you some indication on whether or not
you'll be replacing the drive in the near future.

>
> I think what I've found is that restarting the datanodes in the manner I
> describe shows that the mount points on the drives with the reset blocks
> and newly formatted partition have gone bad. Then I'm not sure the namenode
> will use these locations, even if it does not show the volumes failed.
> Without a way to reinitialize the disks, specifically the mount points, I
> assume my efforts are in vain.
>

So, there's really three ways that bad blocks can get dealt with:

1.  physical drive identifies a bad block via SMART and remaps it to a
reserved block.  There's a limited supply of these.  This happens
automatically within the drive's own firmware controller.
2.  virtual drive identifies a bad block via the raid controller and remaps
it somewhere, probably to a section of blocks reserved by the controller
for this.  Similar to #1.  In case of multi-disk raids, the bad block
identified that lives on disk A could potentially be remapped to a good
block on disk B because of the virtual disk block remapping.
3.  mkfs identifies bad blocks during filesystem creation and prevents data
from being written there.

I'm not sure what the actual recovery behavior is for #2 in the case of
single-disk raid0.

If #1 occurs, the drive should just be replaced.  If you absolutely can't
replace it, you can try doing #3 (assuming you use ext3/ext4; not sure
about how to do it with other filesystems), but don't be surprised if it
doesn't work or if you begin having mysterious data corruption as more and
more sectors fail on the disk.

> Therefore the only procedure that makes sense is to decommission the nodes
> with which I want to bring the failed volumes back up. It just didn't make
> sense to me that if we have a large number of disks with good data, that we
> would end up wiping that data and starting over again.
>
>
You don't need to decom the whole thing just to replace a single disk.

There is one case where doing a full decom is useful and that has to do
with the "risk" that the replacement drive can become a hotspot depending
on how you've configured the block placement policy that the datanode uses
to determine how to fill drives up.  In this particular case, if you have
hotspotted drives because the datanode is choosing to place all new blocks
on the new drive until it equalizes in usage compared to other drives in
the system, you could run into performance issues.  In practice, it
probably doesn't matter much.  "Fixing" the problem means doing a full
decommission, then adding the node back in and running a full cluster
rebalance with the Balancer.

There's an interesting discussion about intra-datanode block placement at
https://issues.apache.org/jira/browse/HDFS-1312.

In my cluster, we almost never go this route when replacing just one disk
in a system.  We have 12 disks in each node, so replacing 1 means only 9%
of the data *on that node* could potentially run into this.  And with the
size of our cluster that's somewhere below 0.1% of the data that could be
affected.  Just not worth worrying about it.

Anyhow, replace the disk.  You'll be a happier Hadoop user then. :-)

Travis
-- 
Travis Campbell
travis@ghostar.org

Re: decommissioning disks on a data node

Posted by Travis <hc...@ghostar.org>.

On Thu, Oct 16, 2014 at 11:41 PM, Colin Kincaid Williams <di...@uw.edu>
wrote:

>  Hi Travis,
>
> Thanks for your input. I forgot to mention that the drives are most likely
> in the single drive configuration that you describe.
>

Then clearing the virtual badblock list is unlikely to do anything useful
if the drive itself already has failed sectors.  One thing you can do is

/opt/srvadmin/bin/omreport storage pdisk controller=$controller_number

and check to see if any of them show as a non-critical state (as opposed to
Ok) or failed state.  If it's reporting non-critical, this means that OMSA
is treating it as going to fail in the near future.  There's a script
written to work as a nagios check called check_openmanage.  It has a nice
way of consolidating all of the information that OMSA reports on into a
useful and concise report of what's actually happening with your Dell
hardware.

It's available here:

http://folk.uio.no/trondham/software/check_openmanage.html

These two things will at least give you some indication on whether or not
you'll be replacing the drive in the near future.

>
> I think what I've found is that restarting the datanodes in the manner I
> describe shows that the mount points on the drives with the reset blocks
> and newly formatted partition have gone bad. Then I'm not sure the namenode
> will use these locations, even if it does not show the volumes failed.
> Without a way to reinitialize the disks, specifically the mount points, I
> assume my efforts are in vain.
>

So, there's really three ways that bad blocks can get dealt with:

1.  physical drive identifies a bad block via SMART and remaps it to a
reserved block.  There's a limited supply of these.  This happens
automatically within the drive's own firmware controller.
2.  virtual drive identifies a bad block via the raid controller and remaps
it somewhere, probably to a section of blocks reserved by the controller
for this.  Similar to #1.  In case of multi-disk raids, the bad block
identified that lives on disk A could potentially be remapped to a good
block on disk B because of the virtual disk block remapping.
3.  mkfs identifies bad blocks during filesystem creation and prevents data
from being written there.

I'm not sure what the actual recovery behavior is for #2 in the case of
single-disk raid0.

If #1 occurs, the drive should just be replaced.  If you absolutely can't
replace it, you can try doing #3 (assuming you use ext3/ext4; not sure
about how to do it with other filesystems), but don't be surprised if it
doesn't work or if you begin having mysterious data corruption as more and
more sectors fail on the disk.

> Therefore the only procedure that makes sense is to decommission the nodes
> with which I want to bring the failed volumes back up. It just didn't make
> sense to me that if we have a large number of disks with good data, that we
> would end up wiping that data and starting over again.
>
>
You don't need to decom the whole thing just to replace a single disk.

There is one case where doing a full decom is useful and that has to do
with the "risk" that the replacement drive can become a hotspot depending
on how you've configured the block placement policy that the datanode uses
to determine how to fill drives up.  In this particular case, if you have
hotspotted drives because the datanode is choosing to place all new blocks
on the new drive until it equalizes in usage compared to other drives in
the system, you could run into performance issues.  In practice, it
probably doesn't matter much.  "Fixing" the problem means doing a full
decommission, then adding the node back in and running a full cluster
rebalance with the Balancer.

There's an interesting discussion about intra-datanode block placement at
https://issues.apache.org/jira/browse/HDFS-1312.

In my cluster, we almost never go this route when replacing just one disk
in a system.  We have 12 disks in each node, so replacing 1 means only 9%
of the data *on that node* could potentially run into this.  And with the
size of our cluster that's somewhere below 0.1% of the data that could be
affected.  Just not worth worrying about it.

Anyhow, replace the disk.  You'll be a happier Hadoop user then. :-)

Travis
-- 
Travis Campbell
travis@ghostar.org

Re: decommissioning disks on a data node

Posted by Colin Kincaid Williams <di...@uw.edu>.

 Hi Travis,

Thanks for your input. I forgot to mention that the drives are most likely
in the single drive configuration that you describe.

I think what I've found is that restarting the datanodes in the manner I
describe shows that the mount points on the drives with the reset blocks
and newly formatted partition have gone bad. Then I'm not sure the namenode
will use these locations, even if it does not show the volumes failed.
Without a way to reinitialize the disks, specifically the mount points, I
assume my efforts are in vain.

Therefore the only procedure that makes sense is to decommission the nodes
with which I want to bring the failed volumes back up. It just didn't make
sense to me that if we have a large number of disks with good data, that we
would end up wiping that data and starting over again.

On Oct 16, 2014 8:36 PM, "Travis" <hc...@ghostar.org> wrote:

>
>
> On Thu, Oct 16, 2014 at 10:01 PM, Colin Kincaid Williams <di...@uw.edu>
> wrote:
>
>> For some reason he seems intent on resetting the bad Virtual blocks, and
>> giving the drives another shot. From what he told me, nothing is under
>> warranty anymore. My first suggestion was to get rid of the disks.
>>
>> Here's the command:
>>
>> /opt/dell/srvadmin/bin/omconfig storage vdisk action=clearvdbadblocks
>> controller=1 vdisk=$vid
>>
>
> Well, the usefulness of this action is going to entirely depend on how
> you've actually set up the virtual disks.
>
> If you've set it up so there's only one physical disk in each vdisk
> (single-disk RAID0), then the bad "virtual" block is likely going to map to
> a real bad block.
>
> If you're doing something where there are multiple disks associated with
> each virtual disk (eg, RAID1, RAID10 ... can't remember if RAID5/RAID6 can
> exhibit what follows), it's possible for the virtual device to have a bad
> block that is actually mapped to a good physical block underneath.  This
> can happen, for example, if you had a failing drive in the vdisk and
> replaced it, but the controller had remapped the bad virtual block to some
> place good.  Replacing the drive with a good one makes the controller think
> the bad block is still there.  Dell calls it a punctured stripe (for better
> description see
> http://lists.us.dell.com/pipermail/linux-poweredge/2010-December/043832.html).
> In this case, the fix is clearing the virtual badblock list with the above
> command.
>
>
>> I'm still curious about how hadoop blocks work. I'm assuming that each
>> block is stored on one of the many mountpoints, and not divided between
>> them. I know there is a tolerated volume failure option in hdfs-site.xml.
>>
>
> Correct.  Each HDFS block is actually treated as a file that lives on a
> regular filesystem, like ext3 or ext4.   If you did an ls inside one of
> your vdisk's, you'd see the raw blocks that the datanode is actually
> storing.  You just wouldn't be able to easily tell what file that block was
> a part of because it's named with a block id, not the actual file name.
>
>
>> Then if the operations I laid out are legitimate, specifically removing
>> the drive in question and restarting the data node. The advantage being
>> less re-replication and less downtime.
>>
>>
> Yup.  It will minimize the actual prolonged outage of the datanode
> itself.  You'll get a little re-replication while the datanode process is
> off, but if you keep that time reasonably short, you should be fine.  When
> the datanode process comes back up, it will walk all of it's configured
> filesystems determining which blocks it still has on disk and report that
> back to the namenode.  Once that happens, re-replication will stop because
> the namenode knows where those missing blocks are and no longer treat them
> as under-replicated.
>
> Note:  You'll still get some re-replication occurring for the blocks that
> lived on the drive you removed.  But it's only a drive's worth of blocks,
> not a whole datanode.
>
> Travis
> --
> Travis Campbell
> travis@ghostar.org
>

Re: decommissioning disks on a data node

Posted by Colin Kincaid Williams <di...@uw.edu>.

 Hi Travis,

Thanks for your input. I forgot to mention that the drives are most likely
in the single drive configuration that you describe.

I think what I've found is that restarting the datanodes in the manner I
describe shows that the mount points on the drives with the reset blocks
and newly formatted partition have gone bad. Then I'm not sure the namenode
will use these locations, even if it does not show the volumes failed.
Without a way to reinitialize the disks, specifically the mount points, I
assume my efforts are in vain.

Therefore the only procedure that makes sense is to decommission the nodes
with which I want to bring the failed volumes back up. It just didn't make
sense to me that if we have a large number of disks with good data, that we
would end up wiping that data and starting over again.

On Oct 16, 2014 8:36 PM, "Travis" <hc...@ghostar.org> wrote:

>
>
> On Thu, Oct 16, 2014 at 10:01 PM, Colin Kincaid Williams <di...@uw.edu>
> wrote:
>
>> For some reason he seems intent on resetting the bad Virtual blocks, and
>> giving the drives another shot. From what he told me, nothing is under
>> warranty anymore. My first suggestion was to get rid of the disks.
>>
>> Here's the command:
>>
>> /opt/dell/srvadmin/bin/omconfig storage vdisk action=clearvdbadblocks
>> controller=1 vdisk=$vid
>>
>
> Well, the usefulness of this action is going to entirely depend on how
> you've actually set up the virtual disks.
>
> If you've set it up so there's only one physical disk in each vdisk
> (single-disk RAID0), then the bad "virtual" block is likely going to map to
> a real bad block.
>
> If you're doing something where there are multiple disks associated with
> each virtual disk (eg, RAID1, RAID10 ... can't remember if RAID5/RAID6 can
> exhibit what follows), it's possible for the virtual device to have a bad
> block that is actually mapped to a good physical block underneath.  This
> can happen, for example, if you had a failing drive in the vdisk and
> replaced it, but the controller had remapped the bad virtual block to some
> place good.  Replacing the drive with a good one makes the controller think
> the bad block is still there.  Dell calls it a punctured stripe (for better
> description see
> http://lists.us.dell.com/pipermail/linux-poweredge/2010-December/043832.html).
> In this case, the fix is clearing the virtual badblock list with the above
> command.
>
>
>> I'm still curious about how hadoop blocks work. I'm assuming that each
>> block is stored on one of the many mountpoints, and not divided between
>> them. I know there is a tolerated volume failure option in hdfs-site.xml.
>>
>
> Correct.  Each HDFS block is actually treated as a file that lives on a
> regular filesystem, like ext3 or ext4.   If you did an ls inside one of
> your vdisk's, you'd see the raw blocks that the datanode is actually
> storing.  You just wouldn't be able to easily tell what file that block was
> a part of because it's named with a block id, not the actual file name.
>
>
>> Then if the operations I laid out are legitimate, specifically removing
>> the drive in question and restarting the data node. The advantage being
>> less re-replication and less downtime.
>>
>>
> Yup.  It will minimize the actual prolonged outage of the datanode
> itself.  You'll get a little re-replication while the datanode process is
> off, but if you keep that time reasonably short, you should be fine.  When
> the datanode process comes back up, it will walk all of it's configured
> filesystems determining which blocks it still has on disk and report that
> back to the namenode.  Once that happens, re-replication will stop because
> the namenode knows where those missing blocks are and no longer treat them
> as under-replicated.
>
> Note:  You'll still get some re-replication occurring for the blocks that
> lived on the drive you removed.  But it's only a drive's worth of blocks,
> not a whole datanode.
>
> Travis
> --
> Travis Campbell
> travis@ghostar.org
>

Re: decommissioning disks on a data node

Posted by Colin Kincaid Williams <di...@uw.edu>.

 Hi Travis,

Thanks for your input. I forgot to mention that the drives are most likely
in the single drive configuration that you describe.

I think what I've found is that restarting the datanodes in the manner I
describe shows that the mount points on the drives with the reset blocks
and newly formatted partition have gone bad. Then I'm not sure the namenode
will use these locations, even if it does not show the volumes failed.
Without a way to reinitialize the disks, specifically the mount points, I
assume my efforts are in vain.

Therefore the only procedure that makes sense is to decommission the nodes
with which I want to bring the failed volumes back up. It just didn't make
sense to me that if we have a large number of disks with good data, that we
would end up wiping that data and starting over again.

On Oct 16, 2014 8:36 PM, "Travis" <hc...@ghostar.org> wrote:

>
>
> On Thu, Oct 16, 2014 at 10:01 PM, Colin Kincaid Williams <di...@uw.edu>
> wrote:
>
>> For some reason he seems intent on resetting the bad Virtual blocks, and
>> giving the drives another shot. From what he told me, nothing is under
>> warranty anymore. My first suggestion was to get rid of the disks.
>>
>> Here's the command:
>>
>> /opt/dell/srvadmin/bin/omconfig storage vdisk action=clearvdbadblocks
>> controller=1 vdisk=$vid
>>
>
> Well, the usefulness of this action is going to entirely depend on how
> you've actually set up the virtual disks.
>
> If you've set it up so there's only one physical disk in each vdisk
> (single-disk RAID0), then the bad "virtual" block is likely going to map to
> a real bad block.
>
> If you're doing something where there are multiple disks associated with
> each virtual disk (eg, RAID1, RAID10 ... can't remember if RAID5/RAID6 can
> exhibit what follows), it's possible for the virtual device to have a bad
> block that is actually mapped to a good physical block underneath.  This
> can happen, for example, if you had a failing drive in the vdisk and
> replaced it, but the controller had remapped the bad virtual block to some
> place good.  Replacing the drive with a good one makes the controller think
> the bad block is still there.  Dell calls it a punctured stripe (for better
> description see
> http://lists.us.dell.com/pipermail/linux-poweredge/2010-December/043832.html).
> In this case, the fix is clearing the virtual badblock list with the above
> command.
>
>
>> I'm still curious about how hadoop blocks work. I'm assuming that each
>> block is stored on one of the many mountpoints, and not divided between
>> them. I know there is a tolerated volume failure option in hdfs-site.xml.
>>
>
> Correct.  Each HDFS block is actually treated as a file that lives on a
> regular filesystem, like ext3 or ext4.   If you did an ls inside one of
> your vdisk's, you'd see the raw blocks that the datanode is actually
> storing.  You just wouldn't be able to easily tell what file that block was
> a part of because it's named with a block id, not the actual file name.
>
>
>> Then if the operations I laid out are legitimate, specifically removing
>> the drive in question and restarting the data node. The advantage being
>> less re-replication and less downtime.
>>
>>
> Yup.  It will minimize the actual prolonged outage of the datanode
> itself.  You'll get a little re-replication while the datanode process is
> off, but if you keep that time reasonably short, you should be fine.  When
> the datanode process comes back up, it will walk all of it's configured
> filesystems determining which blocks it still has on disk and report that
> back to the namenode.  Once that happens, re-replication will stop because
> the namenode knows where those missing blocks are and no longer treat them
> as under-replicated.
>
> Note:  You'll still get some re-replication occurring for the blocks that
> lived on the drive you removed.  But it's only a drive's worth of blocks,
> not a whole datanode.
>
> Travis
> --
> Travis Campbell
> travis@ghostar.org
>

Re: decommissioning disks on a data node

Posted by Colin Kincaid Williams <di...@uw.edu>.

 Hi Travis,

Thanks for your input. I forgot to mention that the drives are most likely
in the single drive configuration that you describe.

I think what I've found is that restarting the datanodes in the manner I
describe shows that the mount points on the drives with the reset blocks
and newly formatted partition have gone bad. Then I'm not sure the namenode
will use these locations, even if it does not show the volumes failed.
Without a way to reinitialize the disks, specifically the mount points, I
assume my efforts are in vain.

Therefore the only procedure that makes sense is to decommission the nodes
with which I want to bring the failed volumes back up. It just didn't make
sense to me that if we have a large number of disks with good data, that we
would end up wiping that data and starting over again.

On Oct 16, 2014 8:36 PM, "Travis" <hc...@ghostar.org> wrote:

>
>
> On Thu, Oct 16, 2014 at 10:01 PM, Colin Kincaid Williams <di...@uw.edu>
> wrote:
>
>> For some reason he seems intent on resetting the bad Virtual blocks, and
>> giving the drives another shot. From what he told me, nothing is under
>> warranty anymore. My first suggestion was to get rid of the disks.
>>
>> Here's the command:
>>
>> /opt/dell/srvadmin/bin/omconfig storage vdisk action=clearvdbadblocks
>> controller=1 vdisk=$vid
>>
>
> Well, the usefulness of this action is going to entirely depend on how
> you've actually set up the virtual disks.
>
> If you've set it up so there's only one physical disk in each vdisk
> (single-disk RAID0), then the bad "virtual" block is likely going to map to
> a real bad block.
>
> If you're doing something where there are multiple disks associated with
> each virtual disk (eg, RAID1, RAID10 ... can't remember if RAID5/RAID6 can
> exhibit what follows), it's possible for the virtual device to have a bad
> block that is actually mapped to a good physical block underneath.  This
> can happen, for example, if you had a failing drive in the vdisk and
> replaced it, but the controller had remapped the bad virtual block to some
> place good.  Replacing the drive with a good one makes the controller think
> the bad block is still there.  Dell calls it a punctured stripe (for better
> description see
> http://lists.us.dell.com/pipermail/linux-poweredge/2010-December/043832.html).
> In this case, the fix is clearing the virtual badblock list with the above
> command.
>
>
>> I'm still curious about how hadoop blocks work. I'm assuming that each
>> block is stored on one of the many mountpoints, and not divided between
>> them. I know there is a tolerated volume failure option in hdfs-site.xml.
>>
>
> Correct.  Each HDFS block is actually treated as a file that lives on a
> regular filesystem, like ext3 or ext4.   If you did an ls inside one of
> your vdisk's, you'd see the raw blocks that the datanode is actually
> storing.  You just wouldn't be able to easily tell what file that block was
> a part of because it's named with a block id, not the actual file name.
>
>
>> Then if the operations I laid out are legitimate, specifically removing
>> the drive in question and restarting the data node. The advantage being
>> less re-replication and less downtime.
>>
>>
> Yup.  It will minimize the actual prolonged outage of the datanode
> itself.  You'll get a little re-replication while the datanode process is
> off, but if you keep that time reasonably short, you should be fine.  When
> the datanode process comes back up, it will walk all of it's configured
> filesystems determining which blocks it still has on disk and report that
> back to the namenode.  Once that happens, re-replication will stop because
> the namenode knows where those missing blocks are and no longer treat them
> as under-replicated.
>
> Note:  You'll still get some re-replication occurring for the blocks that
> lived on the drive you removed.  But it's only a drive's worth of blocks,
> not a whole datanode.
>
> Travis
> --
> Travis Campbell
> travis@ghostar.org
>

Re: decommissioning disks on a data node

Posted by Travis <hc...@ghostar.org>.

On Thu, Oct 16, 2014 at 10:01 PM, Colin Kincaid Williams <di...@uw.edu>
wrote:

> For some reason he seems intent on resetting the bad Virtual blocks, and
> giving the drives another shot. From what he told me, nothing is under
> warranty anymore. My first suggestion was to get rid of the disks.
>
> Here's the command:
>
> /opt/dell/srvadmin/bin/omconfig storage vdisk action=clearvdbadblocks
> controller=1 vdisk=$vid
>

Well, the usefulness of this action is going to entirely depend on how
you've actually set up the virtual disks.

If you've set it up so there's only one physical disk in each vdisk
(single-disk RAID0), then the bad "virtual" block is likely going to map to
a real bad block.

If you're doing something where there are multiple disks associated with
each virtual disk (eg, RAID1, RAID10 ... can't remember if RAID5/RAID6 can
exhibit what follows), it's possible for the virtual device to have a bad
block that is actually mapped to a good physical block underneath.  This
can happen, for example, if you had a failing drive in the vdisk and
replaced it, but the controller had remapped the bad virtual block to some
place good.  Replacing the drive with a good one makes the controller think
the bad block is still there.  Dell calls it a punctured stripe (for better
description see
http://lists.us.dell.com/pipermail/linux-poweredge/2010-December/043832.html).
In this case, the fix is clearing the virtual badblock list with the above
command.

> I'm still curious about how hadoop blocks work. I'm assuming that each
> block is stored on one of the many mountpoints, and not divided between
> them. I know there is a tolerated volume failure option in hdfs-site.xml.
>

Correct.  Each HDFS block is actually treated as a file that lives on a
regular filesystem, like ext3 or ext4.   If you did an ls inside one of
your vdisk's, you'd see the raw blocks that the datanode is actually
storing.  You just wouldn't be able to easily tell what file that block was
a part of because it's named with a block id, not the actual file name.

> Then if the operations I laid out are legitimate, specifically removing
> the drive in question and restarting the data node. The advantage being
> less re-replication and less downtime.
>
>
Yup.  It will minimize the actual prolonged outage of the datanode itself.
You'll get a little re-replication while the datanode process is off, but
if you keep that time reasonably short, you should be fine.  When the
datanode process comes back up, it will walk all of it's configured
filesystems determining which blocks it still has on disk and report that
back to the namenode.  Once that happens, re-replication will stop because
the namenode knows where those missing blocks are and no longer treat them
as under-replicated.

Note:  You'll still get some re-replication occurring for the blocks that
lived on the drive you removed.  But it's only a drive's worth of blocks,
not a whole datanode.

Travis
-- 
Travis Campbell
travis@ghostar.org

Re: decommissioning disks on a data node

Posted by Travis <hc...@ghostar.org>.

On Thu, Oct 16, 2014 at 10:01 PM, Colin Kincaid Williams <di...@uw.edu>
wrote:

> For some reason he seems intent on resetting the bad Virtual blocks, and
> giving the drives another shot. From what he told me, nothing is under
> warranty anymore. My first suggestion was to get rid of the disks.
>
> Here's the command:
>
> /opt/dell/srvadmin/bin/omconfig storage vdisk action=clearvdbadblocks
> controller=1 vdisk=$vid
>

Well, the usefulness of this action is going to entirely depend on how
you've actually set up the virtual disks.

If you've set it up so there's only one physical disk in each vdisk
(single-disk RAID0), then the bad "virtual" block is likely going to map to
a real bad block.

If you're doing something where there are multiple disks associated with
each virtual disk (eg, RAID1, RAID10 ... can't remember if RAID5/RAID6 can
exhibit what follows), it's possible for the virtual device to have a bad
block that is actually mapped to a good physical block underneath.  This
can happen, for example, if you had a failing drive in the vdisk and
replaced it, but the controller had remapped the bad virtual block to some
place good.  Replacing the drive with a good one makes the controller think
the bad block is still there.  Dell calls it a punctured stripe (for better
description see
http://lists.us.dell.com/pipermail/linux-poweredge/2010-December/043832.html).
In this case, the fix is clearing the virtual badblock list with the above
command.

> I'm still curious about how hadoop blocks work. I'm assuming that each
> block is stored on one of the many mountpoints, and not divided between
> them. I know there is a tolerated volume failure option in hdfs-site.xml.
>

Correct.  Each HDFS block is actually treated as a file that lives on a
regular filesystem, like ext3 or ext4.   If you did an ls inside one of
your vdisk's, you'd see the raw blocks that the datanode is actually
storing.  You just wouldn't be able to easily tell what file that block was
a part of because it's named with a block id, not the actual file name.

> Then if the operations I laid out are legitimate, specifically removing
> the drive in question and restarting the data node. The advantage being
> less re-replication and less downtime.
>
>
Yup.  It will minimize the actual prolonged outage of the datanode itself.
You'll get a little re-replication while the datanode process is off, but
if you keep that time reasonably short, you should be fine.  When the
datanode process comes back up, it will walk all of it's configured
filesystems determining which blocks it still has on disk and report that
back to the namenode.  Once that happens, re-replication will stop because
the namenode knows where those missing blocks are and no longer treat them
as under-replicated.

Note:  You'll still get some re-replication occurring for the blocks that
lived on the drive you removed.  But it's only a drive's worth of blocks,
not a whole datanode.

Travis
-- 
Travis Campbell
travis@ghostar.org

Re: decommissioning disks on a data node

Posted by Travis <hc...@ghostar.org>.

On Thu, Oct 16, 2014 at 10:01 PM, Colin Kincaid Williams <di...@uw.edu>
wrote:

> For some reason he seems intent on resetting the bad Virtual blocks, and
> giving the drives another shot. From what he told me, nothing is under
> warranty anymore. My first suggestion was to get rid of the disks.
>
> Here's the command:
>
> /opt/dell/srvadmin/bin/omconfig storage vdisk action=clearvdbadblocks
> controller=1 vdisk=$vid
>

Well, the usefulness of this action is going to entirely depend on how
you've actually set up the virtual disks.

If you've set it up so there's only one physical disk in each vdisk
(single-disk RAID0), then the bad "virtual" block is likely going to map to
a real bad block.

If you're doing something where there are multiple disks associated with
each virtual disk (eg, RAID1, RAID10 ... can't remember if RAID5/RAID6 can
exhibit what follows), it's possible for the virtual device to have a bad
block that is actually mapped to a good physical block underneath.  This
can happen, for example, if you had a failing drive in the vdisk and
replaced it, but the controller had remapped the bad virtual block to some
place good.  Replacing the drive with a good one makes the controller think
the bad block is still there.  Dell calls it a punctured stripe (for better
description see
http://lists.us.dell.com/pipermail/linux-poweredge/2010-December/043832.html).
In this case, the fix is clearing the virtual badblock list with the above
command.

> I'm still curious about how hadoop blocks work. I'm assuming that each
> block is stored on one of the many mountpoints, and not divided between
> them. I know there is a tolerated volume failure option in hdfs-site.xml.
>

Correct.  Each HDFS block is actually treated as a file that lives on a
regular filesystem, like ext3 or ext4.   If you did an ls inside one of
your vdisk's, you'd see the raw blocks that the datanode is actually
storing.  You just wouldn't be able to easily tell what file that block was
a part of because it's named with a block id, not the actual file name.

> Then if the operations I laid out are legitimate, specifically removing
> the drive in question and restarting the data node. The advantage being
> less re-replication and less downtime.
>
>
Yup.  It will minimize the actual prolonged outage of the datanode itself.
You'll get a little re-replication while the datanode process is off, but
if you keep that time reasonably short, you should be fine.  When the
datanode process comes back up, it will walk all of it's configured
filesystems determining which blocks it still has on disk and report that
back to the namenode.  Once that happens, re-replication will stop because
the namenode knows where those missing blocks are and no longer treat them
as under-replicated.

Note:  You'll still get some re-replication occurring for the blocks that
lived on the drive you removed.  But it's only a drive's worth of blocks,
not a whole datanode.

Travis
-- 
Travis Campbell
travis@ghostar.org

Re: decommissioning disks on a data node

Posted by Travis <hc...@ghostar.org>.

On Thu, Oct 16, 2014 at 10:01 PM, Colin Kincaid Williams <di...@uw.edu>
wrote:

> For some reason he seems intent on resetting the bad Virtual blocks, and
> giving the drives another shot. From what he told me, nothing is under
> warranty anymore. My first suggestion was to get rid of the disks.
>
> Here's the command:
>
> /opt/dell/srvadmin/bin/omconfig storage vdisk action=clearvdbadblocks
> controller=1 vdisk=$vid
>

Well, the usefulness of this action is going to entirely depend on how
you've actually set up the virtual disks.

If you've set it up so there's only one physical disk in each vdisk
(single-disk RAID0), then the bad "virtual" block is likely going to map to
a real bad block.

If you're doing something where there are multiple disks associated with
each virtual disk (eg, RAID1, RAID10 ... can't remember if RAID5/RAID6 can
exhibit what follows), it's possible for the virtual device to have a bad
block that is actually mapped to a good physical block underneath.  This
can happen, for example, if you had a failing drive in the vdisk and
replaced it, but the controller had remapped the bad virtual block to some
place good.  Replacing the drive with a good one makes the controller think
the bad block is still there.  Dell calls it a punctured stripe (for better
description see
http://lists.us.dell.com/pipermail/linux-poweredge/2010-December/043832.html).
In this case, the fix is clearing the virtual badblock list with the above
command.

> I'm still curious about how hadoop blocks work. I'm assuming that each
> block is stored on one of the many mountpoints, and not divided between
> them. I know there is a tolerated volume failure option in hdfs-site.xml.
>

Correct.  Each HDFS block is actually treated as a file that lives on a
regular filesystem, like ext3 or ext4.   If you did an ls inside one of
your vdisk's, you'd see the raw blocks that the datanode is actually
storing.  You just wouldn't be able to easily tell what file that block was
a part of because it's named with a block id, not the actual file name.

> Then if the operations I laid out are legitimate, specifically removing
> the drive in question and restarting the data node. The advantage being
> less re-replication and less downtime.
>
>
Yup.  It will minimize the actual prolonged outage of the datanode itself.
You'll get a little re-replication while the datanode process is off, but
if you keep that time reasonably short, you should be fine.  When the
datanode process comes back up, it will walk all of it's configured
filesystems determining which blocks it still has on disk and report that
back to the namenode.  Once that happens, re-replication will stop because
the namenode knows where those missing blocks are and no longer treat them
as under-replicated.

Note:  You'll still get some re-replication occurring for the blocks that
lived on the drive you removed.  But it's only a drive's worth of blocks,
not a whole datanode.

Travis
-- 
Travis Campbell
travis@ghostar.org

Re: decommissioning disks on a data node

Posted by Colin Kincaid Williams <di...@uw.edu>.

For some reason he seems intent on resetting the bad Virtual blocks, and
giving the drives another shot. From what he told me, nothing is under
warranty anymore. My first suggestion was to get rid of the disks.

Here's the command:

/opt/dell/srvadmin/bin/omconfig storage vdisk action=clearvdbadblocks
controller=1 vdisk=$vid

I'm still curious about how hadoop blocks work. I'm assuming that each
block is stored on one of the many mountpoints, and not divided between
them. I know there is a tolerated volume failure option in hdfs-site.xml.

Then if the operations I laid out are legitimate, specifically removing the
drive in question and restarting the data node. The advantage being less
re-replication and less downtime.

On Thu, Oct 16, 2014 at 6:58 PM, Travis <hc...@ghostar.org> wrote:

>
>
> On Thu, Oct 16, 2014 at 7:03 PM, Colin Kincaid Williams <di...@uw.edu>
> wrote:
>
>> We have been seeing some of the disks on our cluster having bad blocks,
>> and then failing. We are using some dell PERC H700 disk controllers that
>> create "virtual devices".
>>
>>
> Are you doing a bunch of single-disk RAID0 devices with the PERC to mimic
> JBOD?
>
>
>> Our hosting manager uses a dell utility which reports "virtual device bad
>> blocks". He has suggested that we use the dell tool to remove the "virtual
>> device bad blocks", and then re-format the device.
>>
>
> Which Dell tool is he using for this?  the OMSA tools?  In practice, if
> OMSA is telling you the drive is bad, it's likely already exhausted all the
> available reserved blocks that it could use to remap bad blocks and
> probably not worth messing with the drive.  Just get Dell to replace it
> (assuming your hardware is under warranty or support).
>
>
>>
>>  I'm wondering if we can remove the disks in question from the
>> hdfs-site.xml, and restart the datanode , so that we don't re-replicate the
>> hadoop blocks on the other disks. Then we would go ahead and work on the
>> troubled disk, while the datanode remained up. Finally we would restart the
>> datanode again after re-adding the freshly formatted { possibly new } disk.
>> This way the data on the remaining disks doesn't get re-replicated.
>>
>> I don't know too much about the hadoop block system. Will this work ? Is
>> it an acceptable strategy for disk maintenance ?
>>
>
> The data may still re-replicate from the missing disk within your cluster
> if the namenode determines that those blocks are under-replicated.
>
> Unless your cluster is so tight on space that you couldn't handle taking
> one disk out for maintenance, the re-replication of blocks from the missing
> disk within the cluster should be fine.   You don't need to keep the entire
> datanode down through out the entire time you're running tests on the
> drive.  The process you laid out is basically how we manage disk
> maintenance on our Dells:  stopping the datanode, unmounting the broken
> drive, modifying the hdfs-site.xml for that node, and restarting it.
>
> I've automated some of this process with puppet by taking advantage of
> ext3/ext4's ability to set a label on the partition that puppet looks for
> when configuring mapred-site.xml and hdfs-site.xml.  I talk about it in a
> few blog posts from a few years back if you're interested.
>
>   http://www.ghostar.org/2011/03/hadoop-facter-and-the-puppet-marionette/
>
> http://www.ghostar.org/2013/05/using-cobbler-with-a-fast-file-system-creation-snippet-for-kickstart-post-install/
>
>
> Cheers,
> Travis
> --
> Travis Campbell
> travis@ghostar.org
>

Re: decommissioning disks on a data node

Posted by Colin Kincaid Williams <di...@uw.edu>.

For some reason he seems intent on resetting the bad Virtual blocks, and
giving the drives another shot. From what he told me, nothing is under
warranty anymore. My first suggestion was to get rid of the disks.

Here's the command:

/opt/dell/srvadmin/bin/omconfig storage vdisk action=clearvdbadblocks
controller=1 vdisk=$vid

I'm still curious about how hadoop blocks work. I'm assuming that each
block is stored on one of the many mountpoints, and not divided between
them. I know there is a tolerated volume failure option in hdfs-site.xml.

Then if the operations I laid out are legitimate, specifically removing the
drive in question and restarting the data node. The advantage being less
re-replication and less downtime.

On Thu, Oct 16, 2014 at 6:58 PM, Travis <hc...@ghostar.org> wrote:

>
>
> On Thu, Oct 16, 2014 at 7:03 PM, Colin Kincaid Williams <di...@uw.edu>
> wrote:
>
>> We have been seeing some of the disks on our cluster having bad blocks,
>> and then failing. We are using some dell PERC H700 disk controllers that
>> create "virtual devices".
>>
>>
> Are you doing a bunch of single-disk RAID0 devices with the PERC to mimic
> JBOD?
>
>
>> Our hosting manager uses a dell utility which reports "virtual device bad
>> blocks". He has suggested that we use the dell tool to remove the "virtual
>> device bad blocks", and then re-format the device.
>>
>
> Which Dell tool is he using for this?  the OMSA tools?  In practice, if
> OMSA is telling you the drive is bad, it's likely already exhausted all the
> available reserved blocks that it could use to remap bad blocks and
> probably not worth messing with the drive.  Just get Dell to replace it
> (assuming your hardware is under warranty or support).
>
>
>>
>>  I'm wondering if we can remove the disks in question from the
>> hdfs-site.xml, and restart the datanode , so that we don't re-replicate the
>> hadoop blocks on the other disks. Then we would go ahead and work on the
>> troubled disk, while the datanode remained up. Finally we would restart the
>> datanode again after re-adding the freshly formatted { possibly new } disk.
>> This way the data on the remaining disks doesn't get re-replicated.
>>
>> I don't know too much about the hadoop block system. Will this work ? Is
>> it an acceptable strategy for disk maintenance ?
>>
>
> The data may still re-replicate from the missing disk within your cluster
> if the namenode determines that those blocks are under-replicated.
>
> Unless your cluster is so tight on space that you couldn't handle taking
> one disk out for maintenance, the re-replication of blocks from the missing
> disk within the cluster should be fine.   You don't need to keep the entire
> datanode down through out the entire time you're running tests on the
> drive.  The process you laid out is basically how we manage disk
> maintenance on our Dells:  stopping the datanode, unmounting the broken
> drive, modifying the hdfs-site.xml for that node, and restarting it.
>
> I've automated some of this process with puppet by taking advantage of
> ext3/ext4's ability to set a label on the partition that puppet looks for
> when configuring mapred-site.xml and hdfs-site.xml.  I talk about it in a
> few blog posts from a few years back if you're interested.
>
>   http://www.ghostar.org/2011/03/hadoop-facter-and-the-puppet-marionette/
>
> http://www.ghostar.org/2013/05/using-cobbler-with-a-fast-file-system-creation-snippet-for-kickstart-post-install/
>
>
> Cheers,
> Travis
> --
> Travis Campbell
> travis@ghostar.org
>

Re: decommissioning disks on a data node

Posted by Colin Kincaid Williams <di...@uw.edu>.

For some reason he seems intent on resetting the bad Virtual blocks, and
giving the drives another shot. From what he told me, nothing is under
warranty anymore. My first suggestion was to get rid of the disks.

Here's the command:

/opt/dell/srvadmin/bin/omconfig storage vdisk action=clearvdbadblocks
controller=1 vdisk=$vid

I'm still curious about how hadoop blocks work. I'm assuming that each
block is stored on one of the many mountpoints, and not divided between
them. I know there is a tolerated volume failure option in hdfs-site.xml.

Then if the operations I laid out are legitimate, specifically removing the
drive in question and restarting the data node. The advantage being less
re-replication and less downtime.

On Thu, Oct 16, 2014 at 6:58 PM, Travis <hc...@ghostar.org> wrote:

>
>
> On Thu, Oct 16, 2014 at 7:03 PM, Colin Kincaid Williams <di...@uw.edu>
> wrote:
>
>> We have been seeing some of the disks on our cluster having bad blocks,
>> and then failing. We are using some dell PERC H700 disk controllers that
>> create "virtual devices".
>>
>>
> Are you doing a bunch of single-disk RAID0 devices with the PERC to mimic
> JBOD?
>
>
>> Our hosting manager uses a dell utility which reports "virtual device bad
>> blocks". He has suggested that we use the dell tool to remove the "virtual
>> device bad blocks", and then re-format the device.
>>
>
> Which Dell tool is he using for this?  the OMSA tools?  In practice, if
> OMSA is telling you the drive is bad, it's likely already exhausted all the
> available reserved blocks that it could use to remap bad blocks and
> probably not worth messing with the drive.  Just get Dell to replace it
> (assuming your hardware is under warranty or support).
>
>
>>
>>  I'm wondering if we can remove the disks in question from the
>> hdfs-site.xml, and restart the datanode , so that we don't re-replicate the
>> hadoop blocks on the other disks. Then we would go ahead and work on the
>> troubled disk, while the datanode remained up. Finally we would restart the
>> datanode again after re-adding the freshly formatted { possibly new } disk.
>> This way the data on the remaining disks doesn't get re-replicated.
>>
>> I don't know too much about the hadoop block system. Will this work ? Is
>> it an acceptable strategy for disk maintenance ?
>>
>
> The data may still re-replicate from the missing disk within your cluster
> if the namenode determines that those blocks are under-replicated.
>
> Unless your cluster is so tight on space that you couldn't handle taking
> one disk out for maintenance, the re-replication of blocks from the missing
> disk within the cluster should be fine.   You don't need to keep the entire
> datanode down through out the entire time you're running tests on the
> drive.  The process you laid out is basically how we manage disk
> maintenance on our Dells:  stopping the datanode, unmounting the broken
> drive, modifying the hdfs-site.xml for that node, and restarting it.
>
> I've automated some of this process with puppet by taking advantage of
> ext3/ext4's ability to set a label on the partition that puppet looks for
> when configuring mapred-site.xml and hdfs-site.xml.  I talk about it in a
> few blog posts from a few years back if you're interested.
>
>   http://www.ghostar.org/2011/03/hadoop-facter-and-the-puppet-marionette/
>
> http://www.ghostar.org/2013/05/using-cobbler-with-a-fast-file-system-creation-snippet-for-kickstart-post-install/
>
>
> Cheers,
> Travis
> --
> Travis Campbell
> travis@ghostar.org
>

Re: decommissioning disks on a data node

Posted by Colin Kincaid Williams <di...@uw.edu>.

For some reason he seems intent on resetting the bad Virtual blocks, and
giving the drives another shot. From what he told me, nothing is under
warranty anymore. My first suggestion was to get rid of the disks.

Here's the command:

/opt/dell/srvadmin/bin/omconfig storage vdisk action=clearvdbadblocks
controller=1 vdisk=$vid

I'm still curious about how hadoop blocks work. I'm assuming that each
block is stored on one of the many mountpoints, and not divided between
them. I know there is a tolerated volume failure option in hdfs-site.xml.

Then if the operations I laid out are legitimate, specifically removing the
drive in question and restarting the data node. The advantage being less
re-replication and less downtime.

On Thu, Oct 16, 2014 at 6:58 PM, Travis <hc...@ghostar.org> wrote:

>
>
> On Thu, Oct 16, 2014 at 7:03 PM, Colin Kincaid Williams <di...@uw.edu>
> wrote:
>
>> We have been seeing some of the disks on our cluster having bad blocks,
>> and then failing. We are using some dell PERC H700 disk controllers that
>> create "virtual devices".
>>
>>
> Are you doing a bunch of single-disk RAID0 devices with the PERC to mimic
> JBOD?
>
>
>> Our hosting manager uses a dell utility which reports "virtual device bad
>> blocks". He has suggested that we use the dell tool to remove the "virtual
>> device bad blocks", and then re-format the device.
>>
>
> Which Dell tool is he using for this?  the OMSA tools?  In practice, if
> OMSA is telling you the drive is bad, it's likely already exhausted all the
> available reserved blocks that it could use to remap bad blocks and
> probably not worth messing with the drive.  Just get Dell to replace it
> (assuming your hardware is under warranty or support).
>
>
>>
>>  I'm wondering if we can remove the disks in question from the
>> hdfs-site.xml, and restart the datanode , so that we don't re-replicate the
>> hadoop blocks on the other disks. Then we would go ahead and work on the
>> troubled disk, while the datanode remained up. Finally we would restart the
>> datanode again after re-adding the freshly formatted { possibly new } disk.
>> This way the data on the remaining disks doesn't get re-replicated.
>>
>> I don't know too much about the hadoop block system. Will this work ? Is
>> it an acceptable strategy for disk maintenance ?
>>
>
> The data may still re-replicate from the missing disk within your cluster
> if the namenode determines that those blocks are under-replicated.
>
> Unless your cluster is so tight on space that you couldn't handle taking
> one disk out for maintenance, the re-replication of blocks from the missing
> disk within the cluster should be fine.   You don't need to keep the entire
> datanode down through out the entire time you're running tests on the
> drive.  The process you laid out is basically how we manage disk
> maintenance on our Dells:  stopping the datanode, unmounting the broken
> drive, modifying the hdfs-site.xml for that node, and restarting it.
>
> I've automated some of this process with puppet by taking advantage of
> ext3/ext4's ability to set a label on the partition that puppet looks for
> when configuring mapred-site.xml and hdfs-site.xml.  I talk about it in a
> few blog posts from a few years back if you're interested.
>
>   http://www.ghostar.org/2011/03/hadoop-facter-and-the-puppet-marionette/
>
> http://www.ghostar.org/2013/05/using-cobbler-with-a-fast-file-system-creation-snippet-for-kickstart-post-install/
>
>
> Cheers,
> Travis
> --
> Travis Campbell
> travis@ghostar.org
>

Re: decommissioning disks on a data node

Posted by Travis <hc...@ghostar.org>.

On Thu, Oct 16, 2014 at 7:03 PM, Colin Kincaid Williams <di...@uw.edu>
wrote:

> We have been seeing some of the disks on our cluster having bad blocks,
> and then failing. We are using some dell PERC H700 disk controllers that
> create "virtual devices".
>
>
Are you doing a bunch of single-disk RAID0 devices with the PERC to mimic
JBOD?

> Our hosting manager uses a dell utility which reports "virtual device bad
> blocks". He has suggested that we use the dell tool to remove the "virtual
> device bad blocks", and then re-format the device.
>

Which Dell tool is he using for this?  the OMSA tools?  In practice, if
OMSA is telling you the drive is bad, it's likely already exhausted all the
available reserved blocks that it could use to remap bad blocks and
probably not worth messing with the drive.  Just get Dell to replace it
(assuming your hardware is under warranty or support).

>
>  I'm wondering if we can remove the disks in question from the
> hdfs-site.xml, and restart the datanode , so that we don't re-replicate the
> hadoop blocks on the other disks. Then we would go ahead and work on the
> troubled disk, while the datanode remained up. Finally we would restart the
> datanode again after re-adding the freshly formatted { possibly new } disk.
> This way the data on the remaining disks doesn't get re-replicated.
>
> I don't know too much about the hadoop block system. Will this work ? Is
> it an acceptable strategy for disk maintenance ?
>

The data may still re-replicate from the missing disk within your cluster
if the namenode determines that those blocks are under-replicated.

Unless your cluster is so tight on space that you couldn't handle taking
one disk out for maintenance, the re-replication of blocks from the missing
disk within the cluster should be fine.   You don't need to keep the entire
datanode down through out the entire time you're running tests on the
drive.  The process you laid out is basically how we manage disk
maintenance on our Dells:  stopping the datanode, unmounting the broken
drive, modifying the hdfs-site.xml for that node, and restarting it.

I've automated some of this process with puppet by taking advantage of
ext3/ext4's ability to set a label on the partition that puppet looks for
when configuring mapred-site.xml and hdfs-site.xml.  I talk about it in a
few blog posts from a few years back if you're interested.

  http://www.ghostar.org/2011/03/hadoop-facter-and-the-puppet-marionette/

http://www.ghostar.org/2013/05/using-cobbler-with-a-fast-file-system-creation-snippet-for-kickstart-post-install/

Cheers,
Travis
-- 
Travis Campbell
travis@ghostar.org

Re: decommissioning disks on a data node

Posted by Travis <hc...@ghostar.org>.

On Thu, Oct 16, 2014 at 7:03 PM, Colin Kincaid Williams <di...@uw.edu>
wrote:

> We have been seeing some of the disks on our cluster having bad blocks,
> and then failing. We are using some dell PERC H700 disk controllers that
> create "virtual devices".
>
>
Are you doing a bunch of single-disk RAID0 devices with the PERC to mimic
JBOD?

> Our hosting manager uses a dell utility which reports "virtual device bad
> blocks". He has suggested that we use the dell tool to remove the "virtual
> device bad blocks", and then re-format the device.
>

Which Dell tool is he using for this?  the OMSA tools?  In practice, if
OMSA is telling you the drive is bad, it's likely already exhausted all the
available reserved blocks that it could use to remap bad blocks and
probably not worth messing with the drive.  Just get Dell to replace it
(assuming your hardware is under warranty or support).

>
>  I'm wondering if we can remove the disks in question from the
> hdfs-site.xml, and restart the datanode , so that we don't re-replicate the
> hadoop blocks on the other disks. Then we would go ahead and work on the
> troubled disk, while the datanode remained up. Finally we would restart the
> datanode again after re-adding the freshly formatted { possibly new } disk.
> This way the data on the remaining disks doesn't get re-replicated.
>
> I don't know too much about the hadoop block system. Will this work ? Is
> it an acceptable strategy for disk maintenance ?
>

The data may still re-replicate from the missing disk within your cluster
if the namenode determines that those blocks are under-replicated.

Unless your cluster is so tight on space that you couldn't handle taking
one disk out for maintenance, the re-replication of blocks from the missing
disk within the cluster should be fine.   You don't need to keep the entire
datanode down through out the entire time you're running tests on the
drive.  The process you laid out is basically how we manage disk
maintenance on our Dells:  stopping the datanode, unmounting the broken
drive, modifying the hdfs-site.xml for that node, and restarting it.

I've automated some of this process with puppet by taking advantage of
ext3/ext4's ability to set a label on the partition that puppet looks for
when configuring mapred-site.xml and hdfs-site.xml.  I talk about it in a
few blog posts from a few years back if you're interested.

  http://www.ghostar.org/2011/03/hadoop-facter-and-the-puppet-marionette/

http://www.ghostar.org/2013/05/using-cobbler-with-a-fast-file-system-creation-snippet-for-kickstart-post-install/

Cheers,
Travis
-- 
Travis Campbell
travis@ghostar.org

Re: decommissioning disks on a data node

Posted by Travis <hc...@ghostar.org>.

On Thu, Oct 16, 2014 at 7:03 PM, Colin Kincaid Williams <di...@uw.edu>
wrote:

> We have been seeing some of the disks on our cluster having bad blocks,
> and then failing. We are using some dell PERC H700 disk controllers that
> create "virtual devices".
>
>
Are you doing a bunch of single-disk RAID0 devices with the PERC to mimic
JBOD?

> Our hosting manager uses a dell utility which reports "virtual device bad
> blocks". He has suggested that we use the dell tool to remove the "virtual
> device bad blocks", and then re-format the device.
>

Which Dell tool is he using for this?  the OMSA tools?  In practice, if
OMSA is telling you the drive is bad, it's likely already exhausted all the
available reserved blocks that it could use to remap bad blocks and
probably not worth messing with the drive.  Just get Dell to replace it
(assuming your hardware is under warranty or support).

>
>  I'm wondering if we can remove the disks in question from the
> hdfs-site.xml, and restart the datanode , so that we don't re-replicate the
> hadoop blocks on the other disks. Then we would go ahead and work on the
> troubled disk, while the datanode remained up. Finally we would restart the
> datanode again after re-adding the freshly formatted { possibly new } disk.
> This way the data on the remaining disks doesn't get re-replicated.
>
> I don't know too much about the hadoop block system. Will this work ? Is
> it an acceptable strategy for disk maintenance ?
>

The data may still re-replicate from the missing disk within your cluster
if the namenode determines that those blocks are under-replicated.

Unless your cluster is so tight on space that you couldn't handle taking
one disk out for maintenance, the re-replication of blocks from the missing
disk within the cluster should be fine.   You don't need to keep the entire
datanode down through out the entire time you're running tests on the
drive.  The process you laid out is basically how we manage disk
maintenance on our Dells:  stopping the datanode, unmounting the broken
drive, modifying the hdfs-site.xml for that node, and restarting it.

I've automated some of this process with puppet by taking advantage of
ext3/ext4's ability to set a label on the partition that puppet looks for
when configuring mapred-site.xml and hdfs-site.xml.  I talk about it in a
few blog posts from a few years back if you're interested.

  http://www.ghostar.org/2011/03/hadoop-facter-and-the-puppet-marionette/

http://www.ghostar.org/2013/05/using-cobbler-with-a-fast-file-system-creation-snippet-for-kickstart-post-install/

Cheers,
Travis
-- 
Travis Campbell
travis@ghostar.org

Re: decommissioning disks on a data node

Posted by Travis <hc...@ghostar.org>.

On Thu, Oct 16, 2014 at 7:03 PM, Colin Kincaid Williams <di...@uw.edu>
wrote:

> We have been seeing some of the disks on our cluster having bad blocks,
> and then failing. We are using some dell PERC H700 disk controllers that
> create "virtual devices".
>
>
Are you doing a bunch of single-disk RAID0 devices with the PERC to mimic
JBOD?

> Our hosting manager uses a dell utility which reports "virtual device bad
> blocks". He has suggested that we use the dell tool to remove the "virtual
> device bad blocks", and then re-format the device.
>

Which Dell tool is he using for this?  the OMSA tools?  In practice, if
OMSA is telling you the drive is bad, it's likely already exhausted all the
available reserved blocks that it could use to remap bad blocks and
probably not worth messing with the drive.  Just get Dell to replace it
(assuming your hardware is under warranty or support).

>
>  I'm wondering if we can remove the disks in question from the
> hdfs-site.xml, and restart the datanode , so that we don't re-replicate the
> hadoop blocks on the other disks. Then we would go ahead and work on the
> troubled disk, while the datanode remained up. Finally we would restart the
> datanode again after re-adding the freshly formatted { possibly new } disk.
> This way the data on the remaining disks doesn't get re-replicated.
>
> I don't know too much about the hadoop block system. Will this work ? Is
> it an acceptable strategy for disk maintenance ?
>

The data may still re-replicate from the missing disk within your cluster
if the namenode determines that those blocks are under-replicated.

Unless your cluster is so tight on space that you couldn't handle taking
one disk out for maintenance, the re-replication of blocks from the missing
disk within the cluster should be fine.   You don't need to keep the entire
datanode down through out the entire time you're running tests on the
drive.  The process you laid out is basically how we manage disk
maintenance on our Dells:  stopping the datanode, unmounting the broken
drive, modifying the hdfs-site.xml for that node, and restarting it.

I've automated some of this process with puppet by taking advantage of
ext3/ext4's ability to set a label on the partition that puppet looks for
when configuring mapred-site.xml and hdfs-site.xml.  I talk about it in a
few blog posts from a few years back if you're interested.

  http://www.ghostar.org/2011/03/hadoop-facter-and-the-puppet-marionette/

http://www.ghostar.org/2013/05/using-cobbler-with-a-fast-file-system-creation-snippet-for-kickstart-post-install/

Cheers,
Travis
-- 
Travis Campbell
travis@ghostar.org