You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Raju Chinthala <ra...@imaginea.com> on 2014/07/10 12:06:59 UTC

Auto scaling on ec2

Scaling a Hadoop cluster with Hive has the following issues

1. Adding a computing node(Scaling up) when load on the cluster is high
decreases the execution time of the queries but its there is still a huge
time lag as the new node works on data from other nodes.

2. The process of removing a node from the cluster(Scaling down) when load
on the cluster is low, is also time consuming.

To reduce the time to scale the Hadoop cluster, we came up with the
following solution.

Prior to adding a new node, move the data from the existing nodes to the
new node. This balances the cluster and if a new task comes, the newly
added node can take it up as it already has the data (data locality).

When decommissioning a node, move the data available on that node to the
other nodes in the cluster., then decommission it.
We tested this with hive on hadoop on 5 node cluster,

*Time taken for Hive query,*

*4node cluster*

*Existing procedure(added new node) 5node cluster*

*New procedure(added new node) 5node cluster*

16mins,25sec

13mins,38sec

9mins,41sec


check the results and the approach here
<https://github.com/rajuch/Auto-scaling-on-ec2>

Any drawbacks/suggestions on this approach, we would like to hear from you..

-- 
Thanks & Regards,
Raju Chinthala