You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Dalia Sobhy <da...@hotmail.com> on 2012/12/22 16:43:40 UTC

Hbase scalability performance

Dear all,

I am testing a simple hbase application on a cluster of multiple nodes.

I am especially testing the scalability performance, by measuring the time taken for random reads

Data size: 200,000 row
Row key : 0,1,2 very simple row key incremental

But i don't know why by increasing the cluster size, I see the same time.

For ex:
2 Datanodes: 1000 random read: 1.757 sec
3 datanodes: 1000 random read: 1.7 sec

So any help plzzz ??

RE: Hbase scalability performance

Posted by Dalia Sobhy <da...@hotmail.com>.

Dear all,

Thanks for your help.

I am already using coprocessors for this table.

I already tried a program similar to it but using thrift server and my cluster was 23 nodes on Rackspace cloud, but the same I didn't see any improved performance. Then I was advised to use actual machines (not virtual ones), and greater bandwidth than 100Mbps. They told me those two issues caused this performance. But upon trial, I found the same case.

   

> From: dontariq@gmail.com
> Date: Sat, 22 Dec 2012 23:09:54 +0530
> Subject: Re: Hbase scalability performance
> To: user@hbase.apache.org
> 
> I totally agree with Michael. I was about to point out the same thing.
> Probability of RS hotspotting is high when we have sequential keys. Even if
> everything is balanced and your cluster is very well configured you might
> end up with this issue.
> 
> Best Regards,
> Tariq
> +91-9741563634
> https://mtariq.jux.com/
> 
> 
> On Sat, Dec 22, 2012 at 10:24 PM, Mohit Anchlia <mo...@gmail.com>wrote:
> 
> > Also, check how balanced your region servers are accross all the nodes
> >
> > On Sat, Dec 22, 2012 at 8:50 AM, Varun Sharma <va...@pinterest.com> wrote:
> >
> > > Note that adding nodes will improve throughput and not latency. So, if
> > your
> > > client application for benchmarking is single threaded, do not expect an
> > > improvement in number of reads per second by just adding nodes.
> > >
> > > On Sat, Dec 22, 2012 at 8:23 AM, Michael Segel <
> > michael_segel@hotmail.com
> > > >wrote:
> > >
> > > > I thought it was Doug Miel who said that HBase doesn't start to shine
> > > > until you had at least 5 nodes.
> > > > (Apologies if I misspelled Doug's name.)
> > > >
> > > > I happen to concur and if you want to start testing scalability, you
> > will
> > > > want to build a bigger test rig.
> > > >
> > > > Just saying!
> > > >
> > > >
> > > > Oh and you're going to have a hot spot on that row key.
> > > > Maybe do a hashed UUID ?
> > > >
> > > > I would suggest that you consider the following:
> > > >
> > > > Create N number of rows... where N is a very large number of rows.
> > > > Then to generate your random access, do a full table scan to get the N
> > > row
> > > > keys in to memory.
> > > > Using a random number generator,  generate a random number and pop that
> > > > row off the stack so that the next iteration is between 1 and (N-1).
> > > > Do this 200K times.
> > > >
> > > > Now time your 200K random fetches.
> > > >
> > > > It would be interesting to see how it performs  getting an average of a
> > > > 'couple' of runs... then increase the key space by an order of
> > magnitude.
> > > > (Start w 1 million rows, 10 million rows, 100 million rows.... )
> > > >
> > > > In theory... if properly tuned. One should expect near linear results .
> > > >  That is to say the time it takes to get() a row across the data space
> > > > should be consistent. Although I wonder if you would have to somehow
> > > clear
> > > > the cache?
> > > >
> > > >
> > > > Sorry, just a random thought...
> > > >
> > > > -Mike
> > > >
> > > > On Dec 22, 2012, at 10:06 AM, Ted Yu <yu...@gmail.com> wrote:
> > > >
> > > > > By '3 datanodes', did you mean that you also increased the number of
> > > > region
> > > > > servers to 3 ?
> > > > >
> > > > > When your test was running, did you look at Web UI to see whether
> > load
> > > > was
> > > > > balanced ? You can also use Ganglia for such purpose.
> > > > >
> > > > > What version of HBase are you using ?
> > > > >
> > > > > Thanks
> > > > >
> > > > > On Sat, Dec 22, 2012 at 7:43 AM, Dalia Sobhy <
> > > dalia.mohsobhy@hotmail.com
> > > > >wrote:
> > > > >
> > > > >> Dear all,
> > > > >>
> > > > >> I am testing a simple hbase application on a cluster of multiple
> > > nodes.
> > > > >>
> > > > >> I am especially testing the scalability performance, by measuring
> > the
> > > > time
> > > > >> taken for random reads
> > > > >>
> > > > >> Data size: 200,000 row
> > > > >> Row key : 0,1,2 very simple row key incremental
> > > > >>
> > > > >> But i don't know why by increasing the cluster size, I see the same
> > > > time.
> > > > >>
> > > > >> For ex:
> > > > >> 2 Datanodes: 1000 random read: 1.757 sec
> > > > >> 3 datanodes: 1000 random read: 1.7 sec
> > > > >>
> > > > >> So any help plzzz ??
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >

Re: Hbase scalability performance

Posted by Mohammad Tariq <do...@gmail.com>.

I totally agree with Michael. I was about to point out the same thing.
Probability of RS hotspotting is high when we have sequential keys. Even if
everything is balanced and your cluster is very well configured you might
end up with this issue.

Best Regards,
Tariq
+91-9741563634
https://mtariq.jux.com/


On Sat, Dec 22, 2012 at 10:24 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> Also, check how balanced your region servers are accross all the nodes
>
> On Sat, Dec 22, 2012 at 8:50 AM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Note that adding nodes will improve throughput and not latency. So, if
> your
> > client application for benchmarking is single threaded, do not expect an
> > improvement in number of reads per second by just adding nodes.
> >
> > On Sat, Dec 22, 2012 at 8:23 AM, Michael Segel <
> michael_segel@hotmail.com
> > >wrote:
> >
> > > I thought it was Doug Miel who said that HBase doesn't start to shine
> > > until you had at least 5 nodes.
> > > (Apologies if I misspelled Doug's name.)
> > >
> > > I happen to concur and if you want to start testing scalability, you
> will
> > > want to build a bigger test rig.
> > >
> > > Just saying!
> > >
> > >
> > > Oh and you're going to have a hot spot on that row key.
> > > Maybe do a hashed UUID ?
> > >
> > > I would suggest that you consider the following:
> > >
> > > Create N number of rows... where N is a very large number of rows.
> > > Then to generate your random access, do a full table scan to get the N
> > row
> > > keys in to memory.
> > > Using a random number generator,  generate a random number and pop that
> > > row off the stack so that the next iteration is between 1 and (N-1).
> > > Do this 200K times.
> > >
> > > Now time your 200K random fetches.
> > >
> > > It would be interesting to see how it performs  getting an average of a
> > > 'couple' of runs... then increase the key space by an order of
> magnitude.
> > > (Start w 1 million rows, 10 million rows, 100 million rows.... )
> > >
> > > In theory... if properly tuned. One should expect near linear results .
> > >  That is to say the time it takes to get() a row across the data space
> > > should be consistent. Although I wonder if you would have to somehow
> > clear
> > > the cache?
> > >
> > >
> > > Sorry, just a random thought...
> > >
> > > -Mike
> > >
> > > On Dec 22, 2012, at 10:06 AM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > By '3 datanodes', did you mean that you also increased the number of
> > > region
> > > > servers to 3 ?
> > > >
> > > > When your test was running, did you look at Web UI to see whether
> load
> > > was
> > > > balanced ? You can also use Ganglia for such purpose.
> > > >
> > > > What version of HBase are you using ?
> > > >
> > > > Thanks
> > > >
> > > > On Sat, Dec 22, 2012 at 7:43 AM, Dalia Sobhy <
> > dalia.mohsobhy@hotmail.com
> > > >wrote:
> > > >
> > > >> Dear all,
> > > >>
> > > >> I am testing a simple hbase application on a cluster of multiple
> > nodes.
> > > >>
> > > >> I am especially testing the scalability performance, by measuring
> the
> > > time
> > > >> taken for random reads
> > > >>
> > > >> Data size: 200,000 row
> > > >> Row key : 0,1,2 very simple row key incremental
> > > >>
> > > >> But i don't know why by increasing the cluster size, I see the same
> > > time.
> > > >>
> > > >> For ex:
> > > >> 2 Datanodes: 1000 random read: 1.757 sec
> > > >> 3 datanodes: 1000 random read: 1.7 sec
> > > >>
> > > >> So any help plzzz ??
> > > >>
> > > >>
> > >
> > >
> >
>

Re: Hbase scalability performance

Posted by Mohit Anchlia <mo...@gmail.com>.

Also, check how balanced your region servers are accross all the nodes

On Sat, Dec 22, 2012 at 8:50 AM, Varun Sharma <va...@pinterest.com> wrote:

> Note that adding nodes will improve throughput and not latency. So, if your
> client application for benchmarking is single threaded, do not expect an
> improvement in number of reads per second by just adding nodes.
>
> On Sat, Dec 22, 2012 at 8:23 AM, Michael Segel <michael_segel@hotmail.com
> >wrote:
>
> > I thought it was Doug Miel who said that HBase doesn't start to shine
> > until you had at least 5 nodes.
> > (Apologies if I misspelled Doug's name.)
> >
> > I happen to concur and if you want to start testing scalability, you will
> > want to build a bigger test rig.
> >
> > Just saying!
> >
> >
> > Oh and you're going to have a hot spot on that row key.
> > Maybe do a hashed UUID ?
> >
> > I would suggest that you consider the following:
> >
> > Create N number of rows... where N is a very large number of rows.
> > Then to generate your random access, do a full table scan to get the N
> row
> > keys in to memory.
> > Using a random number generator,  generate a random number and pop that
> > row off the stack so that the next iteration is between 1 and (N-1).
> > Do this 200K times.
> >
> > Now time your 200K random fetches.
> >
> > It would be interesting to see how it performs  getting an average of a
> > 'couple' of runs... then increase the key space by an order of magnitude.
> > (Start w 1 million rows, 10 million rows, 100 million rows.... )
> >
> > In theory... if properly tuned. One should expect near linear results .
> >  That is to say the time it takes to get() a row across the data space
> > should be consistent. Although I wonder if you would have to somehow
> clear
> > the cache?
> >
> >
> > Sorry, just a random thought...
> >
> > -Mike
> >
> > On Dec 22, 2012, at 10:06 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > By '3 datanodes', did you mean that you also increased the number of
> > region
> > > servers to 3 ?
> > >
> > > When your test was running, did you look at Web UI to see whether load
> > was
> > > balanced ? You can also use Ganglia for such purpose.
> > >
> > > What version of HBase are you using ?
> > >
> > > Thanks
> > >
> > > On Sat, Dec 22, 2012 at 7:43 AM, Dalia Sobhy <
> dalia.mohsobhy@hotmail.com
> > >wrote:
> > >
> > >> Dear all,
> > >>
> > >> I am testing a simple hbase application on a cluster of multiple
> nodes.
> > >>
> > >> I am especially testing the scalability performance, by measuring the
> > time
> > >> taken for random reads
> > >>
> > >> Data size: 200,000 row
> > >> Row key : 0,1,2 very simple row key incremental
> > >>
> > >> But i don't know why by increasing the cluster size, I see the same
> > time.
> > >>
> > >> For ex:
> > >> 2 Datanodes: 1000 random read: 1.757 sec
> > >> 3 datanodes: 1000 random read: 1.7 sec
> > >>
> > >> So any help plzzz ??
> > >>
> > >>
> >
> >
>

Re: Hbase scalability performance

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Dalia,

      You can go the Hbase webUI to see the details, as Ted has specified
earlier. But if you really want to monitor everything properly I would
suggest to configure Ganglia to capture the metrics. To do a quick check
you can also use "status" command from the Hbase shell.
hbase> status
hbase> status 'simple'
hbase> status 'summary'
hbase> status 'detailed'

HTH

Best Regards,
Tariq
+91-9741563634
https://mtariq.jux.com/


On Sun, Dec 23, 2012 at 7:27 PM, Dimitry Goldin <go...@neofonie.de> wrote:

> Hi,
>
>
> On 23.12.2012 14:38, Dalia Sobhy wrote:
>
>>
>> So do you have an example of multithreading program, because I am using
>> the read-made Java API not thrift server, so I don't know how to write a
>> multithreaded program using this API.
>>
>
> You should take a loot at YCSB (https://github.com/**brianfrankcooper/YCSB<https://github.com/brianfrankcooper/YCSB>),
> maybe one of the premade workloads fits your scenario.
>
> Cheers
>
>

Re: Hbase scalability performance

Posted by Dimitry Goldin <go...@neofonie.de>.

Hi,

On 23.12.2012 14:38, Dalia Sobhy wrote:
>
> So do you have an example of multithreading program, because I am using the read-made Java API not thrift server, so I don't know how to write a multithreaded program using this API.

You should take a loot at YCSB 
(https://github.com/brianfrankcooper/YCSB), maybe one of the premade 
workloads fits your scenario.

Cheers

RE: Hbase scalability performance

Posted by Dalia Sobhy <da...@hotmail.com>.

So do you have an example of multithreading program, because I am using the read-made Java API not thrift server, so I don't know how to write a multithreaded program using this API.

> Date: Sat, 22 Dec 2012 08:50:56 -0800
> Subject: Re: Hbase scalability performance
> From: varun@pinterest.com
> To: user@hbase.apache.org
> 
> Note that adding nodes will improve throughput and not latency. So, if your
> client application for benchmarking is single threaded, do not expect an
> improvement in number of reads per second by just adding nodes.
> 
> On Sat, Dec 22, 2012 at 8:23 AM, Michael Segel <mi...@hotmail.com>wrote:
> 
> > I thought it was Doug Miel who said that HBase doesn't start to shine
> > until you had at least 5 nodes.
> > (Apologies if I misspelled Doug's name.)
> >
> > I happen to concur and if you want to start testing scalability, you will
> > want to build a bigger test rig.
> >
> > Just saying!
> >
> >
> > Oh and you're going to have a hot spot on that row key.
> > Maybe do a hashed UUID ?
> >
> > I would suggest that you consider the following:
> >
> > Create N number of rows... where N is a very large number of rows.
> > Then to generate your random access, do a full table scan to get the N row
> > keys in to memory.
> > Using a random number generator,  generate a random number and pop that
> > row off the stack so that the next iteration is between 1 and (N-1).
> > Do this 200K times.
> >
> > Now time your 200K random fetches.
> >
> > It would be interesting to see how it performs  getting an average of a
> > 'couple' of runs... then increase the key space by an order of magnitude.
> > (Start w 1 million rows, 10 million rows, 100 million rows.... )
> >
> > In theory... if properly tuned. One should expect near linear results .
> >  That is to say the time it takes to get() a row across the data space
> > should be consistent. Although I wonder if you would have to somehow clear
> > the cache?
> >
> >
> > Sorry, just a random thought...
> >
> > -Mike
> >
> > On Dec 22, 2012, at 10:06 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > By '3 datanodes', did you mean that you also increased the number of
> > region
> > > servers to 3 ?
> > >
> > > When your test was running, did you look at Web UI to see whether load
> > was
> > > balanced ? You can also use Ganglia for such purpose.
> > >
> > > What version of HBase are you using ?
> > >
> > > Thanks
> > >
> > > On Sat, Dec 22, 2012 at 7:43 AM, Dalia Sobhy <dalia.mohsobhy@hotmail.com
> > >wrote:
> > >
> > >> Dear all,
> > >>
> > >> I am testing a simple hbase application on a cluster of multiple nodes.
> > >>
> > >> I am especially testing the scalability performance, by measuring the
> > time
> > >> taken for random reads
> > >>
> > >> Data size: 200,000 row
> > >> Row key : 0,1,2 very simple row key incremental
> > >>
> > >> But i don't know why by increasing the cluster size, I see the same
> > time.
> > >>
> > >> For ex:
> > >> 2 Datanodes: 1000 random read: 1.757 sec
> > >> 3 datanodes: 1000 random read: 1.7 sec
> > >>
> > >> So any help plzzz ??
> > >>
> > >>
> >
> >

Re: Hbase scalability performance

Posted by Varun Sharma <va...@pinterest.com>.

Note that adding nodes will improve throughput and not latency. So, if your
client application for benchmarking is single threaded, do not expect an
improvement in number of reads per second by just adding nodes.

On Sat, Dec 22, 2012 at 8:23 AM, Michael Segel <mi...@hotmail.com>wrote:

> I thought it was Doug Miel who said that HBase doesn't start to shine
> until you had at least 5 nodes.
> (Apologies if I misspelled Doug's name.)
>
> I happen to concur and if you want to start testing scalability, you will
> want to build a bigger test rig.
>
> Just saying!
>
>
> Oh and you're going to have a hot spot on that row key.
> Maybe do a hashed UUID ?
>
> I would suggest that you consider the following:
>
> Create N number of rows... where N is a very large number of rows.
> Then to generate your random access, do a full table scan to get the N row
> keys in to memory.
> Using a random number generator,  generate a random number and pop that
> row off the stack so that the next iteration is between 1 and (N-1).
> Do this 200K times.
>
> Now time your 200K random fetches.
>
> It would be interesting to see how it performs  getting an average of a
> 'couple' of runs... then increase the key space by an order of magnitude.
> (Start w 1 million rows, 10 million rows, 100 million rows.... )
>
> In theory... if properly tuned. One should expect near linear results .
>  That is to say the time it takes to get() a row across the data space
> should be consistent. Although I wonder if you would have to somehow clear
> the cache?
>
>
> Sorry, just a random thought...
>
> -Mike
>
> On Dec 22, 2012, at 10:06 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > By '3 datanodes', did you mean that you also increased the number of
> region
> > servers to 3 ?
> >
> > When your test was running, did you look at Web UI to see whether load
> was
> > balanced ? You can also use Ganglia for such purpose.
> >
> > What version of HBase are you using ?
> >
> > Thanks
> >
> > On Sat, Dec 22, 2012 at 7:43 AM, Dalia Sobhy <dalia.mohsobhy@hotmail.com
> >wrote:
> >
> >> Dear all,
> >>
> >> I am testing a simple hbase application on a cluster of multiple nodes.
> >>
> >> I am especially testing the scalability performance, by measuring the
> time
> >> taken for random reads
> >>
> >> Data size: 200,000 row
> >> Row key : 0,1,2 very simple row key incremental
> >>
> >> But i don't know why by increasing the cluster size, I see the same
> time.
> >>
> >> For ex:
> >> 2 Datanodes: 1000 random read: 1.757 sec
> >> 3 datanodes: 1000 random read: 1.7 sec
> >>
> >> So any help plzzz ??
> >>
> >>
>
>

Re: Hbase scalability performance

Posted by Michael Segel <mi...@hotmail.com>.

I thought it was Doug Miel who said that HBase doesn't start to shine until you had at least 5 nodes. 
(Apologies if I misspelled Doug's name.) 

I happen to concur and if you want to start testing scalability, you will want to build a bigger test rig. 

Just saying!

Oh and you're going to have a hot spot on that row key. 
Maybe do a hashed UUID ? 

I would suggest that you consider the following:

Create N number of rows... where N is a very large number of rows. 
Then to generate your random access, do a full table scan to get the N row keys in to memory. 
Using a random number generator,  generate a random number and pop that row off the stack so that the next iteration is between 1 and (N-1). 
Do this 200K times. 

Now time your 200K random fetches. 

It would be interesting to see how it performs  getting an average of a 'couple' of runs... then increase the key space by an order of magnitude. 
(Start w 1 million rows, 10 million rows, 100 million rows.... ) 

In theory... if properly tuned. One should expect near linear results .  That is to say the time it takes to get() a row across the data space should be consistent. Although I wonder if you would have to somehow clear the cache? 

Sorry, just a random thought... 

-Mike

On Dec 22, 2012, at 10:06 AM, Ted Yu <yu...@gmail.com> wrote:

> By '3 datanodes', did you mean that you also increased the number of region
> servers to 3 ?
> 
> When your test was running, did you look at Web UI to see whether load was
> balanced ? You can also use Ganglia for such purpose.
> 
> What version of HBase are you using ?
> 
> Thanks
> 
> On Sat, Dec 22, 2012 at 7:43 AM, Dalia Sobhy <da...@hotmail.com>wrote:
> 
>> Dear all,
>> 
>> I am testing a simple hbase application on a cluster of multiple nodes.
>> 
>> I am especially testing the scalability performance, by measuring the time
>> taken for random reads
>> 
>> Data size: 200,000 row
>> Row key : 0,1,2 very simple row key incremental
>> 
>> But i don't know why by increasing the cluster size, I see the same time.
>> 
>> For ex:
>> 2 Datanodes: 1000 random read: 1.757 sec
>> 3 datanodes: 1000 random read: 1.7 sec
>> 
>> So any help plzzz ??
>> 
>>

RE: Hbase scalability performance

Posted by Dalia Sobhy <da...@hotmail.com>.

I am using 3 region servers.

Hbase version: 0.92
Cloudera Manager: 4.1
 How to know the load is balanced Ted?

> Date: Sat, 22 Dec 2012 08:06:59 -0800
> Subject: Re: Hbase scalability performance
> From: yuzhihong@gmail.com
> To: user@hbase.apache.org
> 
> By '3 datanodes', did you mean that you also increased the number of region
> servers to 3 ?
> 
> When your test was running, did you look at Web UI to see whether load was
> balanced ? You can also use Ganglia for such purpose.
> 
> What version of HBase are you using ?
> 
> Thanks
> 
> On Sat, Dec 22, 2012 at 7:43 AM, Dalia Sobhy <da...@hotmail.com>wrote:
> 
> > Dear all,
> >
> > I am testing a simple hbase application on a cluster of multiple nodes.
> >
> > I am especially testing the scalability performance, by measuring the time
> > taken for random reads
> >
> > Data size: 200,000 row
> > Row key : 0,1,2 very simple row key incremental
> >
> > But i don't know why by increasing the cluster size, I see the same time.
> >
> > For ex:
> > 2 Datanodes: 1000 random read: 1.757 sec
> > 3 datanodes: 1000 random read: 1.7 sec
> >
> > So any help plzzz ??
> >
> >

Re: Hbase scalability performance

Posted by Ted Yu <yu...@gmail.com>.

By '3 datanodes', did you mean that you also increased the number of region
servers to 3 ?

When your test was running, did you look at Web UI to see whether load was
balanced ? You can also use Ganglia for such purpose.

What version of HBase are you using ?

Thanks

On Sat, Dec 22, 2012 at 7:43 AM, Dalia Sobhy <da...@hotmail.com>wrote:

> Dear all,
>
> I am testing a simple hbase application on a cluster of multiple nodes.
>
> I am especially testing the scalability performance, by measuring the time
> taken for random reads
>
> Data size: 200,000 row
> Row key : 0,1,2 very simple row key incremental
>
> But i don't know why by increasing the cluster size, I see the same time.
>
> For ex:
> 2 Datanodes: 1000 random read: 1.757 sec
> 3 datanodes: 1000 random read: 1.7 sec
>
> So any help plzzz ??
>
>