You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Robin Verlangen <ro...@us2.nl> on 2012/08/17 11:54:19 UTC

HDFS disable balancing cluster

Hi there,

We currently run an eight node cluster on Amazon EC2. This is perfect for
our storage, but we want to add a couple of nodes (lets say 32) for
processing a big task. We spin them up, run the jobs, and terminate the
machines.

Sounds OK to me, however I'm aware of the fact that hadoop tries to
replicate data blocks to other nodes in favor of balancing the cluster. I
don't want this, as I will get under-replicated blocks when terminating the
machines.

We use juju for easy cluster administration. This implies that adding a new
hadoop-slave runs both hdfs and hadoop (mapred).

My main question is, is it possible to disable balancing the cluster, or
just to disable the datanode service on the new nodes (meant for processing
only)?


Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

Re: HDFS disable balancing cluster

Posted by Steve Loughran <st...@hortonworks.com>.
you can bring up new worker nodes without the datanode service, just the
task tracker.

these become compute only nodes that load and save all their data on the
(mostly) persistent core set.

your core machines will take a lot more network traffic if you do this; an
8 persistent & 32 transient seems a bit unbalanced. That's something to
experiment with.

The other thing you can play at for compute only nodes is the EC2 spot
market -you can bid for cluster time and get hours of it only when the spot
market matches your bid. After the first hour you continue to get that
rate, but the VM can be terminated without notice. If you bring up some or
all of your task tracker only VMs on this spot market, you may cut costs
down

-steve

On 17 August 2012 02:54, Robin Verlangen <ro...@us2.nl> wrote:

> Hi there,
>
> We currently run an eight node cluster on Amazon EC2. This is perfect for
> our storage, but we want to add a couple of nodes (lets say 32) for
> processing a big task. We spin them up, run the jobs, and terminate the
> machines.
>
> Sounds OK to me, however I'm aware of the fact that hadoop tries to
> replicate data blocks to other nodes in favor of balancing the cluster. I
> don't want this, as I will get under-replicated blocks when terminating the
> machines.
>
> We use juju for easy cluster administration. This implies that adding a
> new hadoop-slave runs both hdfs and hadoop (mapred).
>
> My main question is, is it possible to disable balancing the cluster, or
> just to disable the datanode service on the new nodes (meant for processing
> only)?
>
>
> Best regards,
>
> Robin Verlangen
> *Software engineer*
> *
> *
> W http://www.robinverlangen.nl
> E robin@us2.nl
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>

Re: HDFS disable balancing cluster

Posted by Steve Loughran <st...@hortonworks.com>.
you can bring up new worker nodes without the datanode service, just the
task tracker.

these become compute only nodes that load and save all their data on the
(mostly) persistent core set.

your core machines will take a lot more network traffic if you do this; an
8 persistent & 32 transient seems a bit unbalanced. That's something to
experiment with.

The other thing you can play at for compute only nodes is the EC2 spot
market -you can bid for cluster time and get hours of it only when the spot
market matches your bid. After the first hour you continue to get that
rate, but the VM can be terminated without notice. If you bring up some or
all of your task tracker only VMs on this spot market, you may cut costs
down

-steve

On 17 August 2012 02:54, Robin Verlangen <ro...@us2.nl> wrote:

> Hi there,
>
> We currently run an eight node cluster on Amazon EC2. This is perfect for
> our storage, but we want to add a couple of nodes (lets say 32) for
> processing a big task. We spin them up, run the jobs, and terminate the
> machines.
>
> Sounds OK to me, however I'm aware of the fact that hadoop tries to
> replicate data blocks to other nodes in favor of balancing the cluster. I
> don't want this, as I will get under-replicated blocks when terminating the
> machines.
>
> We use juju for easy cluster administration. This implies that adding a
> new hadoop-slave runs both hdfs and hadoop (mapred).
>
> My main question is, is it possible to disable balancing the cluster, or
> just to disable the datanode service on the new nodes (meant for processing
> only)?
>
>
> Best regards,
>
> Robin Verlangen
> *Software engineer*
> *
> *
> W http://www.robinverlangen.nl
> E robin@us2.nl
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>

RE: HDFS disable balancing cluster

Posted by Leo Leung <ll...@ddn.com>.
In 1.x
The exclude* configuration list will allow you to fine tune which node does processing or storage or both  (Process vs Storage node)

This will work for "dynamic sizing" of process nodes.
It does not work well for "dynamically sizing" your storage nodes. As you have already discovered or known.

Cheers

P.S. Check your EC2 bill.  You'r gonna be reading a lot of data across with your model




From: Robin Verlangen [mailto:robin@us2.nl]
Sent: Friday, August 17, 2012 2:54 AM
To: user@hadoop.apache.org
Subject: HDFS disable balancing cluster

Hi there,

We currently run an eight node cluster on Amazon EC2. This is perfect for our storage, but we want to add a couple of nodes (lets say 32) for processing a big task. We spin them up, run the jobs, and terminate the machines.

Sounds OK to me, however I'm aware of the fact that hadoop tries to replicate data blocks to other nodes in favor of balancing the cluster. I don't want this, as I will get under-replicated blocks when terminating the machines.

We use juju for easy cluster administration. This implies that adding a new hadoop-slave runs both hdfs and hadoop (mapred).

My main question is, is it possible to disable balancing the cluster, or just to disable the datanode service on the new nodes (meant for processing only)?


Best regards,

Robin Verlangen
Software engineer

W http://www.robinverlangen.nl
E robin@us2.nl<ma...@us2.nl>

Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.


RE: HDFS disable balancing cluster

Posted by Leo Leung <ll...@ddn.com>.
In 1.x
The exclude* configuration list will allow you to fine tune which node does processing or storage or both  (Process vs Storage node)

This will work for "dynamic sizing" of process nodes.
It does not work well for "dynamically sizing" your storage nodes. As you have already discovered or known.

Cheers

P.S. Check your EC2 bill.  You'r gonna be reading a lot of data across with your model




From: Robin Verlangen [mailto:robin@us2.nl]
Sent: Friday, August 17, 2012 2:54 AM
To: user@hadoop.apache.org
Subject: HDFS disable balancing cluster

Hi there,

We currently run an eight node cluster on Amazon EC2. This is perfect for our storage, but we want to add a couple of nodes (lets say 32) for processing a big task. We spin them up, run the jobs, and terminate the machines.

Sounds OK to me, however I'm aware of the fact that hadoop tries to replicate data blocks to other nodes in favor of balancing the cluster. I don't want this, as I will get under-replicated blocks when terminating the machines.

We use juju for easy cluster administration. This implies that adding a new hadoop-slave runs both hdfs and hadoop (mapred).

My main question is, is it possible to disable balancing the cluster, or just to disable the datanode service on the new nodes (meant for processing only)?


Best regards,

Robin Verlangen
Software engineer

W http://www.robinverlangen.nl
E robin@us2.nl<ma...@us2.nl>

Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.


Re: HDFS disable balancing cluster

Posted by Steve Loughran <st...@hortonworks.com>.
you can bring up new worker nodes without the datanode service, just the
task tracker.

these become compute only nodes that load and save all their data on the
(mostly) persistent core set.

your core machines will take a lot more network traffic if you do this; an
8 persistent & 32 transient seems a bit unbalanced. That's something to
experiment with.

The other thing you can play at for compute only nodes is the EC2 spot
market -you can bid for cluster time and get hours of it only when the spot
market matches your bid. After the first hour you continue to get that
rate, but the VM can be terminated without notice. If you bring up some or
all of your task tracker only VMs on this spot market, you may cut costs
down

-steve

On 17 August 2012 02:54, Robin Verlangen <ro...@us2.nl> wrote:

> Hi there,
>
> We currently run an eight node cluster on Amazon EC2. This is perfect for
> our storage, but we want to add a couple of nodes (lets say 32) for
> processing a big task. We spin them up, run the jobs, and terminate the
> machines.
>
> Sounds OK to me, however I'm aware of the fact that hadoop tries to
> replicate data blocks to other nodes in favor of balancing the cluster. I
> don't want this, as I will get under-replicated blocks when terminating the
> machines.
>
> We use juju for easy cluster administration. This implies that adding a
> new hadoop-slave runs both hdfs and hadoop (mapred).
>
> My main question is, is it possible to disable balancing the cluster, or
> just to disable the datanode service on the new nodes (meant for processing
> only)?
>
>
> Best regards,
>
> Robin Verlangen
> *Software engineer*
> *
> *
> W http://www.robinverlangen.nl
> E robin@us2.nl
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>

Re: HDFS disable balancing cluster

Posted by Steve Loughran <st...@hortonworks.com>.
you can bring up new worker nodes without the datanode service, just the
task tracker.

these become compute only nodes that load and save all their data on the
(mostly) persistent core set.

your core machines will take a lot more network traffic if you do this; an
8 persistent & 32 transient seems a bit unbalanced. That's something to
experiment with.

The other thing you can play at for compute only nodes is the EC2 spot
market -you can bid for cluster time and get hours of it only when the spot
market matches your bid. After the first hour you continue to get that
rate, but the VM can be terminated without notice. If you bring up some or
all of your task tracker only VMs on this spot market, you may cut costs
down

-steve

On 17 August 2012 02:54, Robin Verlangen <ro...@us2.nl> wrote:

> Hi there,
>
> We currently run an eight node cluster on Amazon EC2. This is perfect for
> our storage, but we want to add a couple of nodes (lets say 32) for
> processing a big task. We spin them up, run the jobs, and terminate the
> machines.
>
> Sounds OK to me, however I'm aware of the fact that hadoop tries to
> replicate data blocks to other nodes in favor of balancing the cluster. I
> don't want this, as I will get under-replicated blocks when terminating the
> machines.
>
> We use juju for easy cluster administration. This implies that adding a
> new hadoop-slave runs both hdfs and hadoop (mapred).
>
> My main question is, is it possible to disable balancing the cluster, or
> just to disable the datanode service on the new nodes (meant for processing
> only)?
>
>
> Best regards,
>
> Robin Verlangen
> *Software engineer*
> *
> *
> W http://www.robinverlangen.nl
> E robin@us2.nl
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>

RE: HDFS disable balancing cluster

Posted by Leo Leung <ll...@ddn.com>.
In 1.x
The exclude* configuration list will allow you to fine tune which node does processing or storage or both  (Process vs Storage node)

This will work for "dynamic sizing" of process nodes.
It does not work well for "dynamically sizing" your storage nodes. As you have already discovered or known.

Cheers

P.S. Check your EC2 bill.  You'r gonna be reading a lot of data across with your model




From: Robin Verlangen [mailto:robin@us2.nl]
Sent: Friday, August 17, 2012 2:54 AM
To: user@hadoop.apache.org
Subject: HDFS disable balancing cluster

Hi there,

We currently run an eight node cluster on Amazon EC2. This is perfect for our storage, but we want to add a couple of nodes (lets say 32) for processing a big task. We spin them up, run the jobs, and terminate the machines.

Sounds OK to me, however I'm aware of the fact that hadoop tries to replicate data blocks to other nodes in favor of balancing the cluster. I don't want this, as I will get under-replicated blocks when terminating the machines.

We use juju for easy cluster administration. This implies that adding a new hadoop-slave runs both hdfs and hadoop (mapred).

My main question is, is it possible to disable balancing the cluster, or just to disable the datanode service on the new nodes (meant for processing only)?


Best regards,

Robin Verlangen
Software engineer

W http://www.robinverlangen.nl
E robin@us2.nl<ma...@us2.nl>

Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.


RE: HDFS disable balancing cluster

Posted by Leo Leung <ll...@ddn.com>.
In 1.x
The exclude* configuration list will allow you to fine tune which node does processing or storage or both  (Process vs Storage node)

This will work for "dynamic sizing" of process nodes.
It does not work well for "dynamically sizing" your storage nodes. As you have already discovered or known.

Cheers

P.S. Check your EC2 bill.  You'r gonna be reading a lot of data across with your model




From: Robin Verlangen [mailto:robin@us2.nl]
Sent: Friday, August 17, 2012 2:54 AM
To: user@hadoop.apache.org
Subject: HDFS disable balancing cluster

Hi there,

We currently run an eight node cluster on Amazon EC2. This is perfect for our storage, but we want to add a couple of nodes (lets say 32) for processing a big task. We spin them up, run the jobs, and terminate the machines.

Sounds OK to me, however I'm aware of the fact that hadoop tries to replicate data blocks to other nodes in favor of balancing the cluster. I don't want this, as I will get under-replicated blocks when terminating the machines.

We use juju for easy cluster administration. This implies that adding a new hadoop-slave runs both hdfs and hadoop (mapred).

My main question is, is it possible to disable balancing the cluster, or just to disable the datanode service on the new nodes (meant for processing only)?


Best regards,

Robin Verlangen
Software engineer

W http://www.robinverlangen.nl
E robin@us2.nl<ma...@us2.nl>

Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.