You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Niels Basjes <Ni...@basjes.nl> on 2013/07/15 23:49:16 UTC

Running a single cluster in multiple datacenters

Hi,

Last week we had a discussion at work regarding setting up our new Hadoop
cluster(s).
One of the things that has changed is that the importance of the Hadoop
stack is growing so we want to be "more available".

One of the points we talked about was setting up the cluster in such a way
that the nodes are physically located in two separate datacenters (on
opposite sides of the same city) with a big network connection in between.
We're currently talking about a cluster in the 50 nodes range, but that
will grow over time.

The advantages I see:
- More CPU power available for jobs.
- The data is automatically copied between the datacenters as long as we
configure them to be different 'racks'.

The disadvantages I see:
- If the network goes out then one half is dead and the other half will
most likely go to safemode because the recovering of the missing replicas
will fill up the disks fast.

What things should we consider also?
Has anyone any experience with such a setup?
Is it a good idea to do this?
What are better options for us to consider?

Thanks for any input.
-- 
Best regards,

Niels Basjes

Re: Running a single cluster in multiple datacenters

Posted by Azuryy Yu <az...@gmail.com>.

Hi Bertrand,
I guess you configured two racks totally. one IDC is a rack, and another IDC is another rack. 
so if you want to don't replicate populate during one IDC down, you had to change the replicate placement policy, 
if there are minimum blocks on one rack, then don't  do anything. (here minims blocks should be '2', which can guarantee you have two blocks at lease in one IDC)
so you had to configure replicator factor to '4' if you adopt my advice.



On Jul 16, 2013, at 6:37 AM, Bertrand Dechoux <de...@gmail.com> wrote:

> According to your own analysis, you wouldn't be more available but that was your aim.
> Did you consider having two separate clusters? One per datacenter, with an automatic copy of the data?
> I understand that load balancing of work and data would not be easy but it seems to me a simple strategy (that I have seen working).
> 
> However, you are stating that the two datacenters are close and linked by a big network connection.
> What is the impact on the latency and the bandwidth? (between two nodes in the same datacenter versus two nodes in different datacenters)
> The main question is what happens when a job will use TaskTrackers from datacenter A but DataNodes from datacenter B.
> It will happen. Simply consider Reducer tasks that don't have any strategy about locality because it doesn't really make sense in a general context.
> 
> Regards
> 
> Bertrand
> 
> 
> On Mon, Jul 15, 2013 at 11:56 PM, <jb...@nanthrax.net> wrote:
> Hi Niels,
> 
> it's depend of the number of replicas and the Hadoop rack configuration (level).
> It's possible to have replicas on the two datacenters.
> 
> What's the rack configuration that you plan ? You can implement your own one and define it using the topology.node.switch.mapping.impl property.
> 
> Regards
> JB
> 
> 
> On 2013-07-15 23:49, Niels Basjes wrote:
> Hi,
> 
> Last week we had a discussion at work regarding setting up our new
> Hadoop cluster(s).
> One of the things that has changed is that the importance of the
> Hadoop stack is growing so we want to be "more available".
> 
> One of the points we talked about was setting up the cluster in such a
> way that the nodes are physically located in two separate datacenters
> (on opposite sides of the same city) with a big network connection in
> between.
> 
> Were currently talking about a cluster in the 50 nodes range, but that
> 
> will grow over time.
> 
> The advantages I see:
> - More CPU power available for jobs.
> - The data is automatically copied between the datacenters as long as
> we configure them to be different racks.
> 
> 
> The disadvantages I see:
> - If the network goes out then one half is dead and the other half
> will most likely go to safemode because the recovering of the missing
> replicas will fill up the disks fast.
> 
> What things should we consider also?
> Has anyone any experience with such a setup?
> Is it a good idea to do this?
> 
> What are better options for us to consider?
> 
> Thanks for any input.
> 
> 
> 
> 
> -- 
> Bertrand Dechoux

Re: Running a single cluster in multiple datacenters

Posted by Azuryy Yu <az...@gmail.com>.

Hi Bertrand,
I guess you configured two racks totally. one IDC is a rack, and another IDC is another rack. 
so if you want to don't replicate populate during one IDC down, you had to change the replicate placement policy, 
if there are minimum blocks on one rack, then don't  do anything. (here minims blocks should be '2', which can guarantee you have two blocks at lease in one IDC)
so you had to configure replicator factor to '4' if you adopt my advice.



On Jul 16, 2013, at 6:37 AM, Bertrand Dechoux <de...@gmail.com> wrote:

> According to your own analysis, you wouldn't be more available but that was your aim.
> Did you consider having two separate clusters? One per datacenter, with an automatic copy of the data?
> I understand that load balancing of work and data would not be easy but it seems to me a simple strategy (that I have seen working).
> 
> However, you are stating that the two datacenters are close and linked by a big network connection.
> What is the impact on the latency and the bandwidth? (between two nodes in the same datacenter versus two nodes in different datacenters)
> The main question is what happens when a job will use TaskTrackers from datacenter A but DataNodes from datacenter B.
> It will happen. Simply consider Reducer tasks that don't have any strategy about locality because it doesn't really make sense in a general context.
> 
> Regards
> 
> Bertrand
> 
> 
> On Mon, Jul 15, 2013 at 11:56 PM, <jb...@nanthrax.net> wrote:
> Hi Niels,
> 
> it's depend of the number of replicas and the Hadoop rack configuration (level).
> It's possible to have replicas on the two datacenters.
> 
> What's the rack configuration that you plan ? You can implement your own one and define it using the topology.node.switch.mapping.impl property.
> 
> Regards
> JB
> 
> 
> On 2013-07-15 23:49, Niels Basjes wrote:
> Hi,
> 
> Last week we had a discussion at work regarding setting up our new
> Hadoop cluster(s).
> One of the things that has changed is that the importance of the
> Hadoop stack is growing so we want to be "more available".
> 
> One of the points we talked about was setting up the cluster in such a
> way that the nodes are physically located in two separate datacenters
> (on opposite sides of the same city) with a big network connection in
> between.
> 
> Were currently talking about a cluster in the 50 nodes range, but that
> 
> will grow over time.
> 
> The advantages I see:
> - More CPU power available for jobs.
> - The data is automatically copied between the datacenters as long as
> we configure them to be different racks.
> 
> 
> The disadvantages I see:
> - If the network goes out then one half is dead and the other half
> will most likely go to safemode because the recovering of the missing
> replicas will fill up the disks fast.
> 
> What things should we consider also?
> Has anyone any experience with such a setup?
> Is it a good idea to do this?
> 
> What are better options for us to consider?
> 
> Thanks for any input.
> 
> 
> 
> 
> -- 
> Bertrand Dechoux

Re: Running a single cluster in multiple datacenters

Posted by Azuryy Yu <az...@gmail.com>.

Hi Bertrand,
I guess you configured two racks totally. one IDC is a rack, and another IDC is another rack. 
so if you want to don't replicate populate during one IDC down, you had to change the replicate placement policy, 
if there are minimum blocks on one rack, then don't  do anything. (here minims blocks should be '2', which can guarantee you have two blocks at lease in one IDC)
so you had to configure replicator factor to '4' if you adopt my advice.



On Jul 16, 2013, at 6:37 AM, Bertrand Dechoux <de...@gmail.com> wrote:

> According to your own analysis, you wouldn't be more available but that was your aim.
> Did you consider having two separate clusters? One per datacenter, with an automatic copy of the data?
> I understand that load balancing of work and data would not be easy but it seems to me a simple strategy (that I have seen working).
> 
> However, you are stating that the two datacenters are close and linked by a big network connection.
> What is the impact on the latency and the bandwidth? (between two nodes in the same datacenter versus two nodes in different datacenters)
> The main question is what happens when a job will use TaskTrackers from datacenter A but DataNodes from datacenter B.
> It will happen. Simply consider Reducer tasks that don't have any strategy about locality because it doesn't really make sense in a general context.
> 
> Regards
> 
> Bertrand
> 
> 
> On Mon, Jul 15, 2013 at 11:56 PM, <jb...@nanthrax.net> wrote:
> Hi Niels,
> 
> it's depend of the number of replicas and the Hadoop rack configuration (level).
> It's possible to have replicas on the two datacenters.
> 
> What's the rack configuration that you plan ? You can implement your own one and define it using the topology.node.switch.mapping.impl property.
> 
> Regards
> JB
> 
> 
> On 2013-07-15 23:49, Niels Basjes wrote:
> Hi,
> 
> Last week we had a discussion at work regarding setting up our new
> Hadoop cluster(s).
> One of the things that has changed is that the importance of the
> Hadoop stack is growing so we want to be "more available".
> 
> One of the points we talked about was setting up the cluster in such a
> way that the nodes are physically located in two separate datacenters
> (on opposite sides of the same city) with a big network connection in
> between.
> 
> Were currently talking about a cluster in the 50 nodes range, but that
> 
> will grow over time.
> 
> The advantages I see:
> - More CPU power available for jobs.
> - The data is automatically copied between the datacenters as long as
> we configure them to be different racks.
> 
> 
> The disadvantages I see:
> - If the network goes out then one half is dead and the other half
> will most likely go to safemode because the recovering of the missing
> replicas will fill up the disks fast.
> 
> What things should we consider also?
> Has anyone any experience with such a setup?
> Is it a good idea to do this?
> 
> What are better options for us to consider?
> 
> Thanks for any input.
> 
> 
> 
> 
> -- 
> Bertrand Dechoux

Re: Running a single cluster in multiple datacenters

Posted by Azuryy Yu <az...@gmail.com>.

Hi Bertrand,
I guess you configured two racks totally. one IDC is a rack, and another IDC is another rack. 
so if you want to don't replicate populate during one IDC down, you had to change the replicate placement policy, 
if there are minimum blocks on one rack, then don't  do anything. (here minims blocks should be '2', which can guarantee you have two blocks at lease in one IDC)
so you had to configure replicator factor to '4' if you adopt my advice.



On Jul 16, 2013, at 6:37 AM, Bertrand Dechoux <de...@gmail.com> wrote:

> According to your own analysis, you wouldn't be more available but that was your aim.
> Did you consider having two separate clusters? One per datacenter, with an automatic copy of the data?
> I understand that load balancing of work and data would not be easy but it seems to me a simple strategy (that I have seen working).
> 
> However, you are stating that the two datacenters are close and linked by a big network connection.
> What is the impact on the latency and the bandwidth? (between two nodes in the same datacenter versus two nodes in different datacenters)
> The main question is what happens when a job will use TaskTrackers from datacenter A but DataNodes from datacenter B.
> It will happen. Simply consider Reducer tasks that don't have any strategy about locality because it doesn't really make sense in a general context.
> 
> Regards
> 
> Bertrand
> 
> 
> On Mon, Jul 15, 2013 at 11:56 PM, <jb...@nanthrax.net> wrote:
> Hi Niels,
> 
> it's depend of the number of replicas and the Hadoop rack configuration (level).
> It's possible to have replicas on the two datacenters.
> 
> What's the rack configuration that you plan ? You can implement your own one and define it using the topology.node.switch.mapping.impl property.
> 
> Regards
> JB
> 
> 
> On 2013-07-15 23:49, Niels Basjes wrote:
> Hi,
> 
> Last week we had a discussion at work regarding setting up our new
> Hadoop cluster(s).
> One of the things that has changed is that the importance of the
> Hadoop stack is growing so we want to be "more available".
> 
> One of the points we talked about was setting up the cluster in such a
> way that the nodes are physically located in two separate datacenters
> (on opposite sides of the same city) with a big network connection in
> between.
> 
> Were currently talking about a cluster in the 50 nodes range, but that
> 
> will grow over time.
> 
> The advantages I see:
> - More CPU power available for jobs.
> - The data is automatically copied between the datacenters as long as
> we configure them to be different racks.
> 
> 
> The disadvantages I see:
> - If the network goes out then one half is dead and the other half
> will most likely go to safemode because the recovering of the missing
> replicas will fill up the disks fast.
> 
> What things should we consider also?
> Has anyone any experience with such a setup?
> Is it a good idea to do this?
> 
> What are better options for us to consider?
> 
> Thanks for any input.
> 
> 
> 
> 
> -- 
> Bertrand Dechoux

Re: Running a single cluster in multiple datacenters

Posted by Bertrand Dechoux <de...@gmail.com>.

According to your own analysis, you wouldn't be more available but that was
your aim.
Did you consider having two separate clusters? One per datacenter, with an
automatic copy of the data?
I understand that load balancing of work and data would not be easy but it
seems to me a simple strategy (that I have seen working).

However, you are stating that the two datacenters are close and linked by a
big network connection.
What is the impact on the latency and the bandwidth? (between two nodes in
the same datacenter versus two nodes in different datacenters)
The main question is what happens when a job will use TaskTrackers from
datacenter A but DataNodes from datacenter B.
It will happen. Simply consider Reducer tasks that don't have any strategy
about locality because it doesn't really make sense in a general context.

Regards

Bertrand

On Mon, Jul 15, 2013 at 11:56 PM, <jb...@nanthrax.net> wrote:

> Hi Niels,
>
> it's depend of the number of replicas and the Hadoop rack configuration
> (level).
> It's possible to have replicas on the two datacenters.
>
> What's the rack configuration that you plan ? You can implement your own
> one and define it using the topology.node.switch.mapping.**impl property.
>
> Regards
> JB
>
>
> On 2013-07-15 23:49, Niels Basjes wrote:
>
>> Hi,
>>
>> Last week we had a discussion at work regarding setting up our new
>> Hadoop cluster(s).
>> One of the things that has changed is that the importance of the
>> Hadoop stack is growing so we want to be "more available".
>>
>> One of the points we talked about was setting up the cluster in such a
>> way that the nodes are physically located in two separate datacenters
>> (on opposite sides of the same city) with a big network connection in
>> between.
>>
>> Were currently talking about a cluster in the 50 nodes range, but that
>>
>> will grow over time.
>>
>> The advantages I see:
>> - More CPU power available for jobs.
>> - The data is automatically copied between the datacenters as long as
>> we configure them to be different racks.
>>
>>
>> The disadvantages I see:
>> - If the network goes out then one half is dead and the other half
>> will most likely go to safemode because the recovering of the missing
>> replicas will fill up the disks fast.
>>
>> What things should we consider also?
>> Has anyone any experience with such a setup?
>> Is it a good idea to do this?
>>
>> What are better options for us to consider?
>>
>> Thanks for any input.
>>
>
>

-- 
Bertrand Dechoux

Re: Running a single cluster in multiple datacenters

Posted by Bertrand Dechoux <de...@gmail.com>.

According to your own analysis, you wouldn't be more available but that was
your aim.
Did you consider having two separate clusters? One per datacenter, with an
automatic copy of the data?
I understand that load balancing of work and data would not be easy but it
seems to me a simple strategy (that I have seen working).

However, you are stating that the two datacenters are close and linked by a
big network connection.
What is the impact on the latency and the bandwidth? (between two nodes in
the same datacenter versus two nodes in different datacenters)
The main question is what happens when a job will use TaskTrackers from
datacenter A but DataNodes from datacenter B.
It will happen. Simply consider Reducer tasks that don't have any strategy
about locality because it doesn't really make sense in a general context.

Regards

Bertrand

On Mon, Jul 15, 2013 at 11:56 PM, <jb...@nanthrax.net> wrote:

> Hi Niels,
>
> it's depend of the number of replicas and the Hadoop rack configuration
> (level).
> It's possible to have replicas on the two datacenters.
>
> What's the rack configuration that you plan ? You can implement your own
> one and define it using the topology.node.switch.mapping.**impl property.
>
> Regards
> JB
>
>
> On 2013-07-15 23:49, Niels Basjes wrote:
>
>> Hi,
>>
>> Last week we had a discussion at work regarding setting up our new
>> Hadoop cluster(s).
>> One of the things that has changed is that the importance of the
>> Hadoop stack is growing so we want to be "more available".
>>
>> One of the points we talked about was setting up the cluster in such a
>> way that the nodes are physically located in two separate datacenters
>> (on opposite sides of the same city) with a big network connection in
>> between.
>>
>> Were currently talking about a cluster in the 50 nodes range, but that
>>
>> will grow over time.
>>
>> The advantages I see:
>> - More CPU power available for jobs.
>> - The data is automatically copied between the datacenters as long as
>> we configure them to be different racks.
>>
>>
>> The disadvantages I see:
>> - If the network goes out then one half is dead and the other half
>> will most likely go to safemode because the recovering of the missing
>> replicas will fill up the disks fast.
>>
>> What things should we consider also?
>> Has anyone any experience with such a setup?
>> Is it a good idea to do this?
>>
>> What are better options for us to consider?
>>
>> Thanks for any input.
>>
>
>

-- 
Bertrand Dechoux

Re: Running a single cluster in multiple datacenters

Posted by Bertrand Dechoux <de...@gmail.com>.

According to your own analysis, you wouldn't be more available but that was
your aim.
Did you consider having two separate clusters? One per datacenter, with an
automatic copy of the data?
I understand that load balancing of work and data would not be easy but it
seems to me a simple strategy (that I have seen working).

However, you are stating that the two datacenters are close and linked by a
big network connection.
What is the impact on the latency and the bandwidth? (between two nodes in
the same datacenter versus two nodes in different datacenters)
The main question is what happens when a job will use TaskTrackers from
datacenter A but DataNodes from datacenter B.
It will happen. Simply consider Reducer tasks that don't have any strategy
about locality because it doesn't really make sense in a general context.

Regards

Bertrand

On Mon, Jul 15, 2013 at 11:56 PM, <jb...@nanthrax.net> wrote:

> Hi Niels,
>
> it's depend of the number of replicas and the Hadoop rack configuration
> (level).
> It's possible to have replicas on the two datacenters.
>
> What's the rack configuration that you plan ? You can implement your own
> one and define it using the topology.node.switch.mapping.**impl property.
>
> Regards
> JB
>
>
> On 2013-07-15 23:49, Niels Basjes wrote:
>
>> Hi,
>>
>> Last week we had a discussion at work regarding setting up our new
>> Hadoop cluster(s).
>> One of the things that has changed is that the importance of the
>> Hadoop stack is growing so we want to be "more available".
>>
>> One of the points we talked about was setting up the cluster in such a
>> way that the nodes are physically located in two separate datacenters
>> (on opposite sides of the same city) with a big network connection in
>> between.
>>
>> Were currently talking about a cluster in the 50 nodes range, but that
>>
>> will grow over time.
>>
>> The advantages I see:
>> - More CPU power available for jobs.
>> - The data is automatically copied between the datacenters as long as
>> we configure them to be different racks.
>>
>>
>> The disadvantages I see:
>> - If the network goes out then one half is dead and the other half
>> will most likely go to safemode because the recovering of the missing
>> replicas will fill up the disks fast.
>>
>> What things should we consider also?
>> Has anyone any experience with such a setup?
>> Is it a good idea to do this?
>>
>> What are better options for us to consider?
>>
>> Thanks for any input.
>>
>
>

-- 
Bertrand Dechoux

Re: Running a single cluster in multiple datacenters

Posted by Bertrand Dechoux <de...@gmail.com>.

According to your own analysis, you wouldn't be more available but that was
your aim.
Did you consider having two separate clusters? One per datacenter, with an
automatic copy of the data?
I understand that load balancing of work and data would not be easy but it
seems to me a simple strategy (that I have seen working).

However, you are stating that the two datacenters are close and linked by a
big network connection.
What is the impact on the latency and the bandwidth? (between two nodes in
the same datacenter versus two nodes in different datacenters)
The main question is what happens when a job will use TaskTrackers from
datacenter A but DataNodes from datacenter B.
It will happen. Simply consider Reducer tasks that don't have any strategy
about locality because it doesn't really make sense in a general context.

Regards

Bertrand

On Mon, Jul 15, 2013 at 11:56 PM, <jb...@nanthrax.net> wrote:

> Hi Niels,
>
> it's depend of the number of replicas and the Hadoop rack configuration
> (level).
> It's possible to have replicas on the two datacenters.
>
> What's the rack configuration that you plan ? You can implement your own
> one and define it using the topology.node.switch.mapping.**impl property.
>
> Regards
> JB
>
>
> On 2013-07-15 23:49, Niels Basjes wrote:
>
>> Hi,
>>
>> Last week we had a discussion at work regarding setting up our new
>> Hadoop cluster(s).
>> One of the things that has changed is that the importance of the
>> Hadoop stack is growing so we want to be "more available".
>>
>> One of the points we talked about was setting up the cluster in such a
>> way that the nodes are physically located in two separate datacenters
>> (on opposite sides of the same city) with a big network connection in
>> between.
>>
>> Were currently talking about a cluster in the 50 nodes range, but that
>>
>> will grow over time.
>>
>> The advantages I see:
>> - More CPU power available for jobs.
>> - The data is automatically copied between the datacenters as long as
>> we configure them to be different racks.
>>
>>
>> The disadvantages I see:
>> - If the network goes out then one half is dead and the other half
>> will most likely go to safemode because the recovering of the missing
>> replicas will fill up the disks fast.
>>
>> What things should we consider also?
>> Has anyone any experience with such a setup?
>> Is it a good idea to do this?
>>
>> What are better options for us to consider?
>>
>> Thanks for any input.
>>
>
>

-- 
Bertrand Dechoux

Re: Running a single cluster in multiple datacenters

Posted by jb...@nanthrax.net.

Hi Niels,

it's depend of the number of replicas and the Hadoop rack configuration 
(level).
It's possible to have replicas on the two datacenters.

What's the rack configuration that you plan ? You can implement your 
own one and define it using the topology.node.switch.mapping.impl 
property.

Regards
JB

On 2013-07-15 23:49, Niels Basjes wrote:
> Hi,
>
> Last week we had a discussion at work regarding setting up our new
> Hadoop cluster(s).
> One of the things that has changed is that the importance of the
> Hadoop stack is growing so we want to be "more available".
>
> One of the points we talked about was setting up the cluster in such 
> a
> way that the nodes are physically located in two separate datacenters
> (on opposite sides of the same city) with a big network connection in
> between.
>
> Were currently talking about a cluster in the 50 nodes range, but 
> that
> will grow over time.
>
> The advantages I see:
> - More CPU power available for jobs.
> - The data is automatically copied between the datacenters as long as
> we configure them to be different racks.
>
> The disadvantages I see:
> - If the network goes out then one half is dead and the other half
> will most likely go to safemode because the recovering of the missing
> replicas will fill up the disks fast.
>
> What things should we consider also?
> Has anyone any experience with such a setup?
> Is it a good idea to do this?
>
> What are better options for us to consider?
>
> Thanks for any input.

Re: Running a single cluster in multiple datacenters

Posted by jb...@nanthrax.net.

Hi Niels,

it's depend of the number of replicas and the Hadoop rack configuration 
(level).
It's possible to have replicas on the two datacenters.

What's the rack configuration that you plan ? You can implement your 
own one and define it using the topology.node.switch.mapping.impl 
property.

Regards
JB

On 2013-07-15 23:49, Niels Basjes wrote:
> Hi,
>
> Last week we had a discussion at work regarding setting up our new
> Hadoop cluster(s).
> One of the things that has changed is that the importance of the
> Hadoop stack is growing so we want to be "more available".
>
> One of the points we talked about was setting up the cluster in such 
> a
> way that the nodes are physically located in two separate datacenters
> (on opposite sides of the same city) with a big network connection in
> between.
>
> Were currently talking about a cluster in the 50 nodes range, but 
> that
> will grow over time.
>
> The advantages I see:
> - More CPU power available for jobs.
> - The data is automatically copied between the datacenters as long as
> we configure them to be different racks.
>
> The disadvantages I see:
> - If the network goes out then one half is dead and the other half
> will most likely go to safemode because the recovering of the missing
> replicas will fill up the disks fast.
>
> What things should we consider also?
> Has anyone any experience with such a setup?
> Is it a good idea to do this?
>
> What are better options for us to consider?
>
> Thanks for any input.

Re: Running a single cluster in multiple datacenters

Posted by jb...@nanthrax.net.

Hi Niels,

it's depend of the number of replicas and the Hadoop rack configuration 
(level).
It's possible to have replicas on the two datacenters.

What's the rack configuration that you plan ? You can implement your 
own one and define it using the topology.node.switch.mapping.impl 
property.

Regards
JB

On 2013-07-15 23:49, Niels Basjes wrote:
> Hi,
>
> Last week we had a discussion at work regarding setting up our new
> Hadoop cluster(s).
> One of the things that has changed is that the importance of the
> Hadoop stack is growing so we want to be "more available".
>
> One of the points we talked about was setting up the cluster in such 
> a
> way that the nodes are physically located in two separate datacenters
> (on opposite sides of the same city) with a big network connection in
> between.
>
> Were currently talking about a cluster in the 50 nodes range, but 
> that
> will grow over time.
>
> The advantages I see:
> - More CPU power available for jobs.
> - The data is automatically copied between the datacenters as long as
> we configure them to be different racks.
>
> The disadvantages I see:
> - If the network goes out then one half is dead and the other half
> will most likely go to safemode because the recovering of the missing
> replicas will fill up the disks fast.
>
> What things should we consider also?
> Has anyone any experience with such a setup?
> Is it a good idea to do this?
>
> What are better options for us to consider?
>
> Thanks for any input.

Re: Running a single cluster in multiple datacenters

Posted by jb...@nanthrax.net.

Hi Niels,

it's depend of the number of replicas and the Hadoop rack configuration 
(level).
It's possible to have replicas on the two datacenters.

What's the rack configuration that you plan ? You can implement your 
own one and define it using the topology.node.switch.mapping.impl 
property.

Regards
JB

On 2013-07-15 23:49, Niels Basjes wrote:
> Hi,
>
> Last week we had a discussion at work regarding setting up our new
> Hadoop cluster(s).
> One of the things that has changed is that the importance of the
> Hadoop stack is growing so we want to be "more available".
>
> One of the points we talked about was setting up the cluster in such 
> a
> way that the nodes are physically located in two separate datacenters
> (on opposite sides of the same city) with a big network connection in
> between.
>
> Were currently talking about a cluster in the 50 nodes range, but 
> that
> will grow over time.
>
> The advantages I see:
> - More CPU power available for jobs.
> - The data is automatically copied between the datacenters as long as
> we configure them to be different racks.
>
> The disadvantages I see:
> - If the network goes out then one half is dead and the other half
> will most likely go to safemode because the recovering of the missing
> replicas will fill up the disks fast.
>
> What things should we consider also?
> Has anyone any experience with such a setup?
> Is it a good idea to do this?
>
> What are better options for us to consider?
>
> Thanks for any input.