You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by Anthony Ikeda <An...@cardlink.com.au> on 2010/05/25 01:46:24 UTC

Active-Active Performance

I'm new to Hadoop and I've been given the task to see how we might
utilise Hadoop and HBase to implement an Active-Active site layer for
sharing information across a distributed application.

I've been able to:

* Install and get Hadoop running on a single node and am in the
process of configure a 2 node setup.

* Install HBase on a single node and create a table and mapping
as well as insert data into the system

Once I've got the mutli-node configured I hope to run some tests as
well.

I've noticed that trying to start Hadoop in distributed mode, the slave
will ssh to the master to start it as well (bin/start-all.sh) provided
the same path is setup on the remote machine.

Questions:

Can I configure the system IF the Hadoop installation is not in the same
location per machine?

If the master node goes down (say due to electrical fault or system
fault) how do the slave nodes react? Will they continue to run? Will the
nodes be back in sync once the master starts again?

Would I require a particular configuration to ensure that both our sites
can operate within the cluster as well as in a detached fashion (due to
maintenance or network issues)?

We want to ensure that data is added to HBase on each site with the data
synced across both sites. If one site goes down then recovery of data is
imperative.

Anthony Ikeda

Java Analyst/Programmer

Cardlink Services Limited

Level 4, 3 Rider Boulevard

Rhodes NSW 2138

Web: www.cardlink.com.au | Tel: + 61 2 9646 9221 | Fax: + 61 2 9646 9283

**********************************************************************
This e-mail message and any attachments are intended only for the use of the addressee(s) named above and may contain information that is privileged and confidential. If you are not the intended recipient, any display, dissemination, distribution, or copying is strictly prohibited. If you believe you have received this e-mail message in error, please immediately notify the sender by replying to this e-mail message or by telephone to (02) 9646 9222. Please delete the email and any attachments and do not retain the email or any attachments in any form.
**********************************************************************

Re: Active-Active Performance

Posted by Steve Loughran <st...@apache.org>.

Anthony Ikeda wrote:
> Thanks Hemanth,
> 
> In regards to different locations of the HADOOP home this is low
> priority more for testing not production. I was trying to install HADOOP
> for testing over 2 machines with only a Windows XP machine running
> Cygwin and a Mac running Darwin. Not a priority.

Things are much easier if
  -all your machines have the same OS, disk structure
  -you are running on linux
  -you use some CM tool to automate setup/deploy, pushing out of config 
files

Start now, start with VMWare or virtualbox images now, so you learn 
about management sooner rather than later

> In regards to my last question about operating in a detached fashion, we
> are trying to factor in what happens when the link between both sites is
> cut. Will both sites operate independently until the connection is
> re-established? Is there any particular setup required to ensure we can
> cover this scenario or is it an out-of-the-box feature?

HDFS and the MapReduce engine is designed to run on a single datacentre 
with high bandwidth, high reliability links, current releases assume the 
facility is secure and all users are trusted. The key SPOF, the 
Namenode, doesn't do failover, so when it goes down or the network 
partitions, all machines that cannot see the NN poll and spin until it 
comes back -which can take a while, unless you have a secondary namenode 
to keep the persistent files up to date.  the workers all assume that 
the hostname and IPAddr of the namenode doesn't change, and never reread 
their config. You could use DNS to do failover, but you have to tune the 
JVMs to not cache IP addresses for very long.

To do cross site stuff you'd need a separate HDFS filesystem per site, 
synchronisation of data now becomes a task for the higher level apps. I 
don't know what HBase, Cassandra or other column DB tools do here.

-steve

RE: Active-Active Performance

Posted by Anthony Ikeda <An...@cardlink.com.au>.

Sorry Hemantha, internal terminology :)

We will most likely have 2 data centres (Site A and Site B) and we need data to be available between the 2 sites, i.e., if Site A stores data, Site B must be able to read it pretty much straight away.

We are thinking of installing Hadoop and HBase on each Site to ensure data stored at Site A is readily available for Site B. We are not sure how many installations are required per site yet but I'm guessing one site will house the Master the other the Slave.

Because this is an Active-Active configuration, we need to ensure that if the link between each site goes down, they can still operate to a degree and once the link is recovered, both sites will sync up once more.

Also from a maintenance point of view, we may wish to add more instances of Hadoop/HBase slaves on new machines in each site without disturbing the operation of the application.

Anthony

-----Original Message-----
From: Hemanth Yamijala [mailto:yhemanth@gmail.com] 
Sent: Tuesday, 25 May 2010 12:19 PM
To: general@hadoop.apache.org
Subject: Re: Active-Active Performance

Anthony,

> In regards to different locations of the HADOOP home this is low
> priority more for testing not production. I was trying to install HADOOP
> for testing over 2 machines with only a Windows XP machine running
> Cygwin and a Mac running Darwin. Not a priority.
>
> In regards to my last question about operating in a detached fashion, we
> are trying to factor in what happens when the link between both sites is
> cut. Will both sites operate independently until the connection is
> re-established? Is there any particular setup required to ensure we can
> cover this scenario or is it an out-of-the-box feature?

When you say 'sites', do you mean two different Hadoop installations ?
In general, each site is independent. So, I am unable to understand
where the link comes in.


>
> Anthony
>
>
> -----Original Message-----
> From: Hemanth Yamijala [mailto:yhemanth@gmail.com]
> Sent: Tuesday, 25 May 2010 12:08 PM
> To: general@hadoop.apache.org
> Subject: Re: Active-Active Performance
>
> Anthony,
>
> I'm new to Hadoop and I've been given the task to see how we might
> utilise
>> Hadoop and HBase to implement an Active-Active site layer for sharing
>> information across a distributed application.
>>
>>
>>
>> I've been able to:
>>
>> *         Install and get Hadoop running on a single node and am in
> the
>> process of configure a 2 node setup.
>>
>> *         Install HBase on a single node and create a table and
> mapping as
>> well as insert data into the system
>>
>>
>>
>> Once I've got the mutli-node configured I hope to run some tests as
> well.
>>
>>
>>
>> I've noticed that trying to start Hadoop in distributed mode, the
> slave
>> will ssh to the master to start it as well (bin/start-all.sh) provided
> the
>> same path is setup on the remote machine.
>>
>>
>>
>> Questions:
>>
>> Can I configure the system IF the Hadoop installation is not in the
> same
>> location per machine?
>>
>
> I would think configuring and managing such a system would get very
> complex
> - for e.g. if you'll want to add nodes to expand in future. You would
> also
> not be able to take advantage of the very helpful scripts that come with
> Hadoop. Is there a reason why you want to do this ?
>
>> If the master node goes down (say due to electrical fault or system
> fault)
>> how do the slave nodes react? Will they continue to run? Will the
> nodes be
>> back in sync once the master starts again?
>>
>
> Hadoop slaves will continue. They will enter a retry loop trying to
> connect
> to the master until it comes up. In doing so, they could fill up log
> files
> very fast though. If the master starts with the same configuration,
> (same
> host, ports), they should be able to connect and resume.
>
>> Would I require a particular configuration to ensure that both our
> sites
>> can operate within the cluster as well as in a detached fashion (due
> to
>> maintenance or network issues)?
>>
>>
>>
> I did not quite follow this. Can you explain a little more about how you
> want to setup your system ?
>
> Thanks
> Hemanth
>
> _____________________________________________________________________
> This e-mail has been scanned for viruses by MCI's Internet Managed
> Scanning Services - powered by MessageLabs. For further information
> visit http://www.mci.com
>
> **********************************************************************
> This e-mail message and any attachments are intended only for the use of the addressee(s) named above and may contain information that is privileged and confidential. If you are not the intended recipient, any display, dissemination, distribution, or copying is strictly prohibited.   If you believe you have received this e-mail message in error, please immediately notify the sender by replying to this e-mail message or by telephone to (02) 9646 9222. Please delete the email and any attachments and do not retain the email or any attachments in any form.
> **********************************************************************
>

_____________________________________________________________________ 
This e-mail has been scanned for viruses by MCI's Internet Managed 
Scanning Services - powered by MessageLabs. For further information 
visit http://www.mci.com

**********************************************************************
This e-mail message and any attachments are intended only for the use of the addressee(s) named above and may contain information that is privileged and confidential. If you are not the intended recipient, any display, dissemination, distribution, or copying is strictly prohibited.   If you believe you have received this e-mail message in error, please immediately notify the sender by replying to this e-mail message or by telephone to (02) 9646 9222. Please delete the email and any attachments and do not retain the email or any attachments in any form.
**********************************************************************

Re: Active-Active Performance

Posted by Hemanth Yamijala <yh...@gmail.com>.

Anthony,

> In regards to different locations of the HADOOP home this is low
> priority more for testing not production. I was trying to install HADOOP
> for testing over 2 machines with only a Windows XP machine running
> Cygwin and a Mac running Darwin. Not a priority.
>
> In regards to my last question about operating in a detached fashion, we
> are trying to factor in what happens when the link between both sites is
> cut. Will both sites operate independently until the connection is
> re-established? Is there any particular setup required to ensure we can
> cover this scenario or is it an out-of-the-box feature?

When you say 'sites', do you mean two different Hadoop installations ?
In general, each site is independent. So, I am unable to understand
where the link comes in.


>
> Anthony
>
>
> -----Original Message-----
> From: Hemanth Yamijala [mailto:yhemanth@gmail.com]
> Sent: Tuesday, 25 May 2010 12:08 PM
> To: general@hadoop.apache.org
> Subject: Re: Active-Active Performance
>
> Anthony,
>
> I'm new to Hadoop and I've been given the task to see how we might
> utilise
>> Hadoop and HBase to implement an Active-Active site layer for sharing
>> information across a distributed application.
>>
>>
>>
>> I've been able to:
>>
>> *         Install and get Hadoop running on a single node and am in
> the
>> process of configure a 2 node setup.
>>
>> *         Install HBase on a single node and create a table and
> mapping as
>> well as insert data into the system
>>
>>
>>
>> Once I've got the mutli-node configured I hope to run some tests as
> well.
>>
>>
>>
>> I've noticed that trying to start Hadoop in distributed mode, the
> slave
>> will ssh to the master to start it as well (bin/start-all.sh) provided
> the
>> same path is setup on the remote machine.
>>
>>
>>
>> Questions:
>>
>> Can I configure the system IF the Hadoop installation is not in the
> same
>> location per machine?
>>
>
> I would think configuring and managing such a system would get very
> complex
> - for e.g. if you'll want to add nodes to expand in future. You would
> also
> not be able to take advantage of the very helpful scripts that come with
> Hadoop. Is there a reason why you want to do this ?
>
>> If the master node goes down (say due to electrical fault or system
> fault)
>> how do the slave nodes react? Will they continue to run? Will the
> nodes be
>> back in sync once the master starts again?
>>
>
> Hadoop slaves will continue. They will enter a retry loop trying to
> connect
> to the master until it comes up. In doing so, they could fill up log
> files
> very fast though. If the master starts with the same configuration,
> (same
> host, ports), they should be able to connect and resume.
>
>> Would I require a particular configuration to ensure that both our
> sites
>> can operate within the cluster as well as in a detached fashion (due
> to
>> maintenance or network issues)?
>>
>>
>>
> I did not quite follow this. Can you explain a little more about how you
> want to setup your system ?
>
> Thanks
> Hemanth
>
> _____________________________________________________________________
> This e-mail has been scanned for viruses by MCI's Internet Managed
> Scanning Services - powered by MessageLabs. For further information
> visit http://www.mci.com
>
> **********************************************************************
> This e-mail message and any attachments are intended only for the use of the addressee(s) named above and may contain information that is privileged and confidential. If you are not the intended recipient, any display, dissemination, distribution, or copying is strictly prohibited.   If you believe you have received this e-mail message in error, please immediately notify the sender by replying to this e-mail message or by telephone to (02) 9646 9222. Please delete the email and any attachments and do not retain the email or any attachments in any form.
> **********************************************************************
>

RE: Active-Active Performance

Posted by Anthony Ikeda <An...@cardlink.com.au>.

Thanks Hemanth,

In regards to different locations of the HADOOP home this is low
priority more for testing not production. I was trying to install HADOOP
for testing over 2 machines with only a Windows XP machine running
Cygwin and a Mac running Darwin. Not a priority.

In regards to my last question about operating in a detached fashion, we
are trying to factor in what happens when the link between both sites is
cut. Will both sites operate independently until the connection is
re-established? Is there any particular setup required to ensure we can
cover this scenario or is it an out-of-the-box feature?

Anthony


-----Original Message-----
From: Hemanth Yamijala [mailto:yhemanth@gmail.com] 
Sent: Tuesday, 25 May 2010 12:08 PM
To: general@hadoop.apache.org
Subject: Re: Active-Active Performance

Anthony,

I'm new to Hadoop and I've been given the task to see how we might
utilise
> Hadoop and HBase to implement an Active-Active site layer for sharing
> information across a distributed application.
>
>
>
> I've been able to:
>
> *         Install and get Hadoop running on a single node and am in
the
> process of configure a 2 node setup.
>
> *         Install HBase on a single node and create a table and
mapping as
> well as insert data into the system
>
>
>
> Once I've got the mutli-node configured I hope to run some tests as
well.
>
>
>
> I've noticed that trying to start Hadoop in distributed mode, the
slave
> will ssh to the master to start it as well (bin/start-all.sh) provided
the
> same path is setup on the remote machine.
>
>
>
> Questions:
>
> Can I configure the system IF the Hadoop installation is not in the
same
> location per machine?
>

I would think configuring and managing such a system would get very
complex
- for e.g. if you'll want to add nodes to expand in future. You would
also
not be able to take advantage of the very helpful scripts that come with
Hadoop. Is there a reason why you want to do this ?

> If the master node goes down (say due to electrical fault or system
fault)
> how do the slave nodes react? Will they continue to run? Will the
nodes be
> back in sync once the master starts again?
>

Hadoop slaves will continue. They will enter a retry loop trying to
connect
to the master until it comes up. In doing so, they could fill up log
files
very fast though. If the master starts with the same configuration,
(same
host, ports), they should be able to connect and resume.

> Would I require a particular configuration to ensure that both our
sites
> can operate within the cluster as well as in a detached fashion (due
to
> maintenance or network issues)?
>
>
>
I did not quite follow this. Can you explain a little more about how you
want to setup your system ?

Thanks
Hemanth

_____________________________________________________________________ 
This e-mail has been scanned for viruses by MCI's Internet Managed 
Scanning Services - powered by MessageLabs. For further information 
visit http://www.mci.com

**********************************************************************
This e-mail message and any attachments are intended only for the use of the addressee(s) named above and may contain information that is privileged and confidential. If you are not the intended recipient, any display, dissemination, distribution, or copying is strictly prohibited.   If you believe you have received this e-mail message in error, please immediately notify the sender by replying to this e-mail message or by telephone to (02) 9646 9222. Please delete the email and any attachments and do not retain the email or any attachments in any form.
**********************************************************************

Re: Active-Active Performance

Posted by Hemanth Yamijala <yh...@gmail.com>.

Anthony,

I’m new to Hadoop and I’ve been given the task to see how we might utilise
> Hadoop and HBase to implement an Active-Active site layer for sharing
> information across a distributed application.
>
>
>
> I’ve been able to:
>
> ·         Install and get Hadoop running on a single node and am in the
> process of configure a 2 node setup.
>
> ·         Install HBase on a single node and create a table and mapping as
> well as insert data into the system
>
>
>
> Once I’ve got the mutli-node configured I hope to run some tests as well.
>
>
>
> I’ve noticed that trying to start Hadoop in distributed mode, the slave
> will ssh to the master to start it as well (bin/start-all.sh) provided the
> same path is setup on the remote machine.
>
>
>
> Questions:
>
> Can I configure the system IF the Hadoop installation is not in the same
> location per machine?
>

I would think configuring and managing such a system would get very complex
- for e.g. if you'll want to add nodes to expand in future. You would also
not be able to take advantage of the very helpful scripts that come with
Hadoop. Is there a reason why you want to do this ?

> If the master node goes down (say due to electrical fault or system fault)
> how do the slave nodes react? Will they continue to run? Will the nodes be
> back in sync once the master starts again?
>

Hadoop slaves will continue. They will enter a retry loop trying to connect
to the master until it comes up. In doing so, they could fill up log files
very fast though. If the master starts with the same configuration, (same
host, ports), they should be able to connect and resume.

> Would I require a particular configuration to ensure that both our sites
> can operate within the cluster as well as in a detached fashion (due to
> maintenance or network issues)?
>
>
>
I did not quite follow this. Can you explain a little more about how you
want to setup your system ?

Thanks
Hemanth