You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Time Less <ti...@gmail.com> on 2012/10/27 04:22:58 UTC

Building a Large Properly-Configured HBase Cluster -- In a Day

Greetings. I have an announcement that of great interest to the HBase/Hive
communities.

tl;dr :: You can build a large, realistically-configured HBase cluster in a
few hours (from nothing) using Chef or Ansible (Puppet in the works). It
also builds large high-availability MySQL clusters as well, and other DBMSs
are in the works. The project is open-source.

Project :: The Palomino Cluster Tool (
https://github.com/time-palominodb/PalominoClusterTool). License: Apache.

Why this is interesting :: The Chef Cookbook and Ansible Playbooks are
known to generate fully-distributed HBase (with Zookeeper, separate
NameNode, JobTracker, HMaster, RegionServer, etc) with far fewer bugs or
limitations of any other Cookbook or Playbooks known before. They also are
not specific to one company's environment, but should be suitable to YOUR
environment.

I've done extensive searches for configuration management scripts that
would do this hard work for me and came up empty-handed, so I've rolled my
own and am OSSing it so that no-one else has to feel my pain.

This code represents hundreds of hours of research, iteration, web
searches, experience, tuning. Many of the common gotchas that... got me...
are covered. Xcievers? Check. Data files in /tmp? Nope. Ulimits? Check.
Init scripts that work with Chef? Check. Documentation covers other typical
gotchas. NameNode formatted? hdfs://users/mapred exists and owned properly?
hdfs://hbase exists and owned properly? Too many to list. Look at the code
yourself, and feel free to write to the project mailing list with any
gotchas/tunings you're aware of that aren't covered in the code.

Interesting Entrance Points :: HDFS+Hive+HBase on CentOS via Chef (
https://github.com/time-palominodb/PalominoClusterTool/tree/master/ChefCookbooks/CentOS/cloudera).
Multiple distributed DBMS, including HDFS+HBase on Ubuntu via Ansible (
https://github.com/time-palominodb/PalominoClusterTool/tree/master/AnsiblePlaybooks/Ubuntu-12.04
).

Feedback welcome. Pull requests more than welcome.

-- 
*Tim Ellis | *Fifth Sigma, Inc.
Excellence in Multimedia and Technology