You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by nick maillard <ni...@fifty-five.com> on 2012/10/27 14:05:27 UTC

Hadoop/Hbase 0.94.2 performance what to expect

Hi everyone

So I've set up a hadoop/hbase/hive 3 ubuntu machines cluster:
master: Ubuntu 64bit, 8 core 3ghz, 16gb mem, gigaethernet connection
2slaves: the same
I went around the different documentations,blogs and articles on hadoop and or
Hbase understanding and tuning. map/reduce tasks 7, up heap param,
xrecievers,compression,speculative exection off etc...
I've installed ycbs to start stress testing as well on my own set of data.

Looking around I saw a lot of experiences and tools to test but to put it simply
put I don'tknow what I should expect. 

When I import through import TSV a 5 gb file it takes about an hour. (my keys
are not incremental)
When I stress test with one thread writing 10million entries it takes a little
over an hour.
When I ask though hive something like 'select * from tableA where valueC=1' on a
table of about 1,5 million elements it takes 4minutes to resolve.Arguably I
should have a rowkey to really get a god time but this example is to test
map/reduce against a dataset.

So all in all what should I expect, is my dataset too small so it seems like a
relatively long time. The writes seem really long and resolving through
map/reduce seems long as well. Off course maybe the time would be the same for a
much larger set which would make a lot more sense.

Just for info I have checked with iostat and my disks are about 95% iddle.

So If someone were kind enough to share what kind of performance I could expect
with my cluster just to see If my set up is really not respondinf how it should
or If I'm using it the wrong way. Or if this is coherent 

regards


Re: Hadoop/Hbase 0.94.2 performance what to expect

Posted by nick maillard <ni...@fifty-five.com>.
Hello Kevin

In the hbase-env If have only upped te heap to 3gb.
But I'll gladly share my full file.

My rowkey set up  is: 
A rowkey: eventid
A single family: events
around 20 string columns 
so in table
confblog_events{
     event_id:{
            event:{
                  columnA: value
                  columnB: value...
            }
     }
}

The import file is a csv like valueA,ROWKEY,valueB,valueC....
it contains around 14million entries.
The Rowkey are not incremental from line to line they are randomly dispersed.
I see the different servers being written but I will check more thouroughly
tomorrow.

If you are kind enough to check and want further info I have opened my cluster:
you can see the hbase at : 
http://91.121.69.14:60030/rs-status
The table would be confblog_events
this should show all my parameters

If you want to see the ImportTSV you can look:
http://91.121.69.14:50030/jobtracker.jsp
in the retired jobs there is IMportTSv that I ran;

As well I'm trying to get a feel of read and write, with bufferes import ti goes
down to 20 minutes which is acceptable.

On the same jobtracker page you will see in retired jobs the select from Hive
which is applied on the same table. The process takes around 4 minutes, off
course it is not applied on the rowkey. I'm trying to understand if this is a
decent time duration or If I am off.

Thanks a lot for your time and help.
I'm eager to understand either the error of my ways or if this is a norma set up.








Re: Hadoop/Hbase 0.94.2 performance what to expect

Posted by Kevin O'dell <ke...@cloudera.com>.
Nick,

  Can you please provide your hbase-env, hbase-site.xml, and a describe of
the table in question?  Also, what is your row key setup?  When you are
doing the write do you see different region servers being written to or
just one?  How many rows are in this 5GB of data?

On Sat, Oct 27, 2012 at 8:05 AM, nick maillard <
nicolas.maillard@fifty-five.com> wrote:

> Hi everyone
>
> So I've set up a hadoop/hbase/hive 3 ubuntu machines cluster:
> master: Ubuntu 64bit, 8 core 3ghz, 16gb mem, gigaethernet connection
> 2slaves: the same
> I went around the different documentations,blogs and articles on hadoop
> and or
> Hbase understanding and tuning. map/reduce tasks 7, up heap param,
> xrecievers,compression,speculative exection off etc...
> I've installed ycbs to start stress testing as well on my own set of data.
>
> Looking around I saw a lot of experiences and tools to test but to put it
> simply
> put I don'tknow what I should expect.
>
> When I import through import TSV a 5 gb file it takes about an hour. (my
> keys
> are not incremental)
> When I stress test with one thread writing 10million entries it takes a
> little
> over an hour.
> When I ask though hive something like 'select * from tableA where
> valueC=1' on a
> table of about 1,5 million elements it takes 4minutes to resolve.Arguably I
> should have a rowkey to really get a god time but this example is to test
> map/reduce against a dataset.
>
> So all in all what should I expect, is my dataset too small so it seems
> like a
> relatively long time. The writes seem really long and resolving through
> map/reduce seems long as well. Off course maybe the time would be the same
> for a
> much larger set which would make a lot more sense.
>
> Just for info I have checked with iostat and my disks are about 95% iddle.
>
> So If someone were kind enough to share what kind of performance I could
> expect
> with my cluster just to see If my set up is really not respondinf how it
> should
> or If I'm using it the wrong way. Or if this is coherent
>
> regards
>
>


-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: Hadoop/Hbase 0.94.2 performance what to expect

Posted by nick maillard <ni...@fifty-five.com>.
Hi lars

Thanks as well for your help.
My machines all have one disk.

I am trying to get a feel on both elements reads and writes.
Hence my tests.
If i used buffered import the times goes down to 20 minutes which is acceptable
compared to one hour.I am wondering if this is in the realms of what should be
expected in a correctly functionnal 3 machines cluster. I can stress test I'm
just not sure how to ananlyse results.

In terms of reads through Hive I am not using the rowkey so I expect it to take
more time, I am wondering If 4 minutes for a 1400000 entrie table is a coherent
time. If I program a multithreaded enviroment for a file of the same entries I
get better performance, however it would not scale As well as Hadoop or Hbase.
So my dataset might not be enough for a relevant test.

If you have time and need more info I fhave opended the cluster:
http://91.121.69.14:50030/jobtracker.jsp
You can look at the retired task select... to see How it went exactly.
or for the Habse cluster:
http://91.121.69.14:60030/rs-status

Thanks a lot to you and kevin for your time and any advice or just process time
tables I could relate to check my cluster implementation and try and better it. 





Re: Hadoop/Hbase 0.94.2 performance what to expect

Posted by lars hofhansl <lh...@yahoo.com>.
Hi Nick,

are you asking about read or write performance? importTSV writes to HBase. Hive is read only. Is this Hive on of HBase, or raw HDFS files?

How many disk drives do you boxes have?


-- Lars



________________________________
 From: nick maillard <ni...@fifty-five.com>
To: user@hbase.apache.org 
Sent: Saturday, October 27, 2012 5:05 AM
Subject: Hadoop/Hbase 0.94.2 performance what to expect
 
Hi everyone

So I've set up a hadoop/hbase/hive 3 ubuntu machines cluster:
master: Ubuntu 64bit, 8 core 3ghz, 16gb mem, gigaethernet connection
2slaves: the same
I went around the different documentations,blogs and articles on hadoop and or
Hbase understanding and tuning. map/reduce tasks 7, up heap param,
xrecievers,compression,speculative exection off etc...
I've installed ycbs to start stress testing as well on my own set of data.

Looking around I saw a lot of experiences and tools to test but to put it simply
put I don'tknow what I should expect. 

When I import through import TSV a 5 gb file it takes about an hour. (my keys
are not incremental)
When I stress test with one thread writing 10million entries it takes a little
over an hour.
When I ask though hive something like 'select * from tableA where valueC=1' on a
table of about 1,5 million elements it takes 4minutes to resolve.Arguably I
should have a rowkey to really get a god time but this example is to test
map/reduce against a dataset.

So all in all what should I expect, is my dataset too small so it seems like a
relatively long time. The writes seem really long and resolving through
map/reduce seems long as well. Off course maybe the time would be the same for a
much larger set which would make a lot more sense.

Just for info I have checked with iostat and my disks are about 95% iddle.

So If someone were kind enough to share what kind of performance I could expect
with my cluster just to see If my set up is really not respondinf how it should
or If I'm using it the wrong way. Or if this is coherent 

regards