You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Oleg Ruchovets <or...@gmail.com> on 2011/02/08 16:45:41 UTC

hadoop infrastructure questions (production environment)

Hi , we are going to production and have some questions to ask:

   We are using 0.20_append  version (as I understand it is  hbase 0.90
requirement).


   1) Currently we have to process 50GB text files per day , it can grow to
150GB
          -- what is the best hadoop file size for our load and are there
suggested disk block size for that size?
          -- We worked using gz and I saw that for every files 1 map task
was assigned.
                  What is the best practice:  to work with gz files and save
disc space or work without archiving ?
                  Lets say we want to get performance benefits and disk
space is less critical.

   2)  Currently adding additional machine to the greed we need manually
maintain all files and configurations.
         Is it possible to auto-deploy hadoop servers without the need to
manually define each one on all nodes?


   3) Can we change masters without reinstalling the entire grid



Thank in advance
 Oleg

Re: hadoop infrastructure questions (production environment)

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Feb 8, 2011, at 7:45 AM, Oleg Ruchovets wrote:

> Hi , we are going to production and have some questions to ask:
> 
>   We are using 0.20_append  version (as I understand it is  hbase 0.90
> requirement).
> 
> 
>   1) Currently we have to process 50GB text files per day , it can grow to
> 150GB
>          -- what is the best hadoop file size for our load and are there
> suggested disk block size for that size?

	What is the retention policy?  You'll likely want something bigger than the default 64mb though if the plan is "indefinitely" and you process these files at every job run.

>          -- We worked using gz and I saw that for every files 1 map task
> was assigned.

	Correct.  gzip files are not splittable.

>                  What is the best practice:  to work with gz files and save
> disc space or work without archiving ?

	I'd recommend converting them to something else (such as a SequenceFile) that can be block compressed.

>                  Lets say we want to get performance benefits and disk
> space is less critical.

	Then uncompress them. :D


>   2)  Currently adding additional machine to the greed we need manually
> maintain all files and configurations.
>         Is it possible to auto-deploy hadoop servers without the need to
> manually define each one on all nodes?

	We basically use bcfg2 to manage different sets of configuration files (NN, JT, DN/TT, and gateway).  When a node is brought up, bcfg2 detects based upon the disk configuration what kind of machine it is and applies the appropriate configurations.  

	The only files that really need to get changed when you add/subtract nodes are the ones on the NN and JT.  The rest of the nodes are oblivious to the rest.


>   3) Can we change masters without reinstalling the entire grid

	Yes.  Depending upon how you manage the service movement (see a previous discussion on this last week), at most you'll need to bounce the DN and TT processes.


Re: hadoop infrastructure questions (production environment)

Posted by Konstantin Boudnik <co...@apache.org>.
On Wed, Feb 9, 2011 at 02:37, Steve Loughran <st...@apache.org> wrote:
> On 08/02/11 15:45, Oleg Ruchovets wrote:
...
>>    2)  Currently adding additional machine to the greed we need manually
>> maintain all files and configurations.
>>          Is it possible to auto-deploy hadoop servers without the need to
>> manually define each one on all nodes?
>
> That's the only way people do it in production clusters: you use
> Configuration Management (CM) tools. Which one you use is your choice, but
> do use one.

You can go with something like Chef or Puppet: these seem to be quite
popular among Hadoop ops nowadays.

Cos

Re: hadoop infrastructure questions (production environment)

Posted by Steve Loughran <st...@apache.org>.
On 08/02/11 15:45, Oleg Ruchovets wrote:
> Hi , we are going to production and have some questions to ask:
>
>     We are using 0.20_append  version (as I understand it is  hbase 0.90
> requirement).
>
>
>     1) Currently we have to process 50GB text files per day , it can grow to
> 150GB
>            -- what is the best hadoop file size for our load and are there
> suggested disk block size for that size?

depends on the #of machines and their performance. The smaller the 
blocks, the better the #of maps that can be assigned blocks, but it puts 
more load on the namenode and job tracker

>            -- We worked using gz and I saw that for every files 1 map task
> was assigned.
>                    What is the best practice:  to work with gz files and save
> disc space or work without archiving ?

Hadoop sequence files can be compressed on a per-block basis. It's not 
as efficient as gz, but reduces your storage and network load.


>                    Lets say we want to get performance benefits and disk
> space is less critical.
>
>     2)  Currently adding additional machine to the greed we need manually
> maintain all files and configurations.
>           Is it possible to auto-deploy hadoop servers without the need to
> manually define each one on all nodes?

That's the only way people do it in production clusters: you use 
Configuration Management (CM) tools. Which one you use is your choice, 
but do use one.

>
>
>     3) Can we change masters without reinstalling the entire grid

-if you can push out a new configuration and restart the workers, you 
can move the master nodes to any machine in the cluster after a failure.

-if you want to leave the nn and JT hostnames the same but change IP 
addresses, you need to restart all the workers, and make sure the DNS 
entries of the master nodes are set to expire rapidly so the OS doesn't 
cache it for long.

-if you have machines set up with the same hostname and IP addresses, 
then you can bring them up as the masters, just have the namenode 
recover the edit log.