You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Li Li <fa...@gmail.com> on 2014/05/07 14:05:20 UTC

how to optimize my cluster setting?

I have a small six nodes cluter. one node run master and namenode,
another run secondary namenode. the other 4 nodes are datanodes and
region servers.
each node has 16GB memory and a 4 core cpu

my application is very simple. I use hbase to store data for a web spider.
the table is:
1. url_db
     row key MD5(url). and there are other columns of the url. average
length of a row is about 1k
2. out_link
     row key MD5(url1)+MD5(url2). and there are anchor text and other
columns. average length is also less than 1K
3. in_link
      row key MD5(url2)+MD5(url1).
4. other tables with very few rows

when a url is fetched by the fetcher, A link extractor will extract
all the urls in this web page.
so with a url, I need to insert new found urls to url_db and
url+childurl to out_link and childurl+url to in_link.

as for reading, there are a few map reduce tasks to select priority
urls from url_db. it use full table scan of url_db and out_link.
map reduce is running every hour and it takes tens of minutes to complete

at the beginning, it's fast. but when url_db expands to tens of
million urls. it slows down. And I found two of the 4 nodes become
very high load but the other two have low load. I use top to find two
nodes' load average is larger than 50 and the other two is less than
1.
I tried to split the region and move them manully. But after some
time, it is not balanced again.

I am using hbase 0.94.11 with hadoop 1.0.0
is hbase 0.96/0.98 's balancer better for me or I shoud adjust some settings to?