You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Hari Sreekumar <hs...@clickable.com> on 2010/11/09 16:29:21 UTC

Why and When to use HTablePool?

When is it preferable to use HTablePool over HTable and vice-versa? If I am
working on just one table, will using HTablePool potentially give me any
performance improvements?

hari

Re: Why and When to use HTablePool?

Posted by Ryan Rawson <ry...@gmail.com>.

Hi,

HTable is a small shim on top of other infrastructure inside the
client.  We only open 1 socket per regionserver per JVM to talk to
each regionserver, and all HTable instances share that.  HTable
instances also create a small ThreadPoolExecutor to help with
parallelism.  Thus the overhead of creating more HTables is not very
high, but it is not entirely 0, hence why HTablePool exists.

The general guideline is that each thread requires their own HTable,
since it's just a shim on top of HConnectionManager and friends, this
requirement shouldn't be too onerous.

-ryan

On Wed, Nov 10, 2010 at 11:57 AM, tsuna <ts...@gmail.com> wrote:
> On Tue, Nov 9, 2010 at 8:14 AM, Michael Segel <mi...@hotmail.com> wrote:
>> The use case for the HTablePool is pretty much the same as any application where you need to fetch a resource from a pool rather than constantly instantiate them.
>>
>> Really the driving factor on which to use (HTable or HTablePool) is going to be your use case, or rather what it is you hope to achieve.
>
> HTable isn't thread-safe when you write to HBase.  I think that's why
> HTablePool exists.
>
> If you want a highly scalable HBase client in which you don't need to
> think about pools or thread safety, you can take a look at asynchbase:
> http://github.com/stumbleupon/asynchbase
>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com
>

RE: Why and When to use HTablePool?

Posted by Michael Segel <mi...@hotmail.com>.

> From: tsunanet@gmail.com
> Date: Wed, 10 Nov 2010 11:57:56 -0800
> Subject: Re: Why and When to use HTablePool?
> To: user@hbase.apache.org
> 
> On Tue, Nov 9, 2010 at 8:14 AM, Michael Segel <mi...@hotmail.com> wrote:
> > The use case for the HTablePool is pretty much the same as any application where you need to fetch a resource from a pool rather than constantly instantiate them.
> >
> > Really the driving factor on which to use (HTable or HTablePool) is going to be your use case, or rather what it is you hope to achieve.
> 
> HTable isn't thread-safe when you write to HBase.  I think that's why
> HTablePool exists.
> 
> If you want a highly scalable HBase client in which you don't need to
> think about pools or thread safety, you can take a look at asynchbase:
> http://github.com/stumbleupon/asynchbase
> 
> -- 
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com

Ok... 
You really don't want to do multi-threading within a m/r job.

I mean, yes you can, but you don't really want to do it unless you have a really good reason.

But to your point, HBase can be used outside of a m/r job which is what I was talking about in the paragraph you cut from my post.
There where you have a multi-threaded client, you'd want to use the HTablePool. 

And again, the driving factor will be your use case... even in a multi-threaded client, you could still use HTable, albeit you need to make sure that you instantiate an instance within the thread.

Re: Why and When to use HTablePool?

Posted by tsuna <ts...@gmail.com>.

On Tue, Nov 9, 2010 at 8:14 AM, Michael Segel <mi...@hotmail.com> wrote:
> The use case for the HTablePool is pretty much the same as any application where you need to fetch a resource from a pool rather than constantly instantiate them.
>
> Really the driving factor on which to use (HTable or HTablePool) is going to be your use case, or rather what it is you hope to achieve.

HTable isn't thread-safe when you write to HBase.  I think that's why
HTablePool exists.

If you want a highly scalable HBase client in which you don't need to
think about pools or thread safety, you can take a look at asynchbase:
http://github.com/stumbleupon/asynchbase

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

RE: Why and When to use HTablePool?

Posted by Michael Segel <mi...@hotmail.com>.



> Date: Tue, 9 Nov 2010 23:13:53 +0530
> Subject: Re: Why and When to use HTablePool?
> From: hsreekumar@clickable.com
> To: user@hbase.apache.org
> 
> So the difference between a pool and an HTable would be negligible in a
> typical map-reduce environment, right.. if I am not creating any new HTable
> instances in the map and reduce phases? Perhaps creating a pool can have
> negative impact in this case?
> 

Well, I'm no expert but I can't see any reason to run a pool of HTable connections from within a m/r job.
(I'm sure someone can figure out a use case...) 

But from our experience... you create a single hbase table instance in setup() and you're set. A pool adds complexity and it doesn't improve performance from what I can see....

> e.g, what performance impact can I expect in my bulk uploading mapreducejob?
> I create an HTable connection in the run() method, each map converts a line
> from a text file to a put instance. Also, it would be great if any of you
> could point me to an example usage of TablePool.
> 
> hari

This is a difficult question to answer. Too many factors. Your hardware, network, design and quality of code all will have an impact on performance.
The most important thing is your design and your code will have the greatest impact on performance.

Re: Why and When to use HTablePool?

Posted by Hari Sreekumar <hs...@clickable.com>.

So the difference between a pool and an HTable would be negligible in a
typical map-reduce environment, right.. if I am not creating any new HTable
instances in the map and reduce phases? Perhaps creating a pool can have
negative impact in this case?

e.g, what performance impact can I expect in my bulk uploading mapreducejob?
I create an HTable connection in the run() method, each map converts a line
from a text file to a put instance. Also, it would be great if any of you
could point me to an example usage of TablePool.

hari

On Tue, Nov 9, 2010 at 9:44 PM, Michael Segel <mi...@hotmail.com>wrote:

>
>
>
> > Date: Tue, 9 Nov 2010 09:57:42 -0600
> > Subject: Re: Why and When to use HTablePool?
> > From: barneyfranks1@gmail.com
> > To: user@hbase.apache.org
> >
> > Two differences that I know of:)
> >
> > With htable you bear the overhead of instantiating the htable for each
> time
> > you need access to it.  The overhead can be substantial if response time
> is
> > your biggest concern.
> > Example:  contact = *new* HTable(config, "contact");
> >
> Huh?
>
> Sorry, but that's a bit of an overly broad statement.
>
> When you're using hbase in a map/reduce environment, you set up a single
> htable instance in setup()
> then reference it in your map() method. So you incur the cost of setting up
> the htable once.
>
>
> If you're working in a single node, and a multi-threaded application like a
> web service reading from HBase, then you may want to have a pool
> of connections. Totally different design.
>
> The use case for the HTablePool is pretty much the same as any application
> where you need to fetch a resource from a pool rather than constantly
> instantiate them.
>
> Really the driving factor on which to use (HTable or HTablePool) is going
> to be your use case, or rather what it is you hope to achieve.
>
>

RE: Why and When to use HTablePool?

Posted by Michael Segel <mi...@hotmail.com>.



> Date: Tue, 9 Nov 2010 09:57:42 -0600
> Subject: Re: Why and When to use HTablePool?
> From: barneyfranks1@gmail.com
> To: user@hbase.apache.org
> 
> Two differences that I know of:)
> 
> With htable you bear the overhead of instantiating the htable for each time
> you need access to it.  The overhead can be substantial if response time is
> your biggest concern.
> Example:  contact = *new* HTable(config, "contact");
> 
Huh?

Sorry, but that's a bit of an overly broad statement.

When you're using hbase in a map/reduce environment, you set up a single htable instance in setup() 
then reference it in your map() method. So you incur the cost of setting up the htable once.


If you're working in a single node, and a multi-threaded application like a web service reading from HBase, then you may want to have a pool 
of connections. Totally different design. 

The use case for the HTablePool is pretty much the same as any application where you need to fetch a resource from a pool rather than constantly instantiate them. 

Really the driving factor on which to use (HTable or HTablePool) is going to be your use case, or rather what it is you hope to achieve.

Re: Why and When to use HTablePool?

Posted by Barney Frank <ba...@gmail.com>.

Two differences that I know of:)

With htable you bear the overhead of instantiating the htable for each time
you need access to it.  The overhead can be substantial if response time is
your biggest concern.
Example:  contact = *new* HTable(config, "contact");

Pooled means that the objects are pooled so you wonldn't bear the overhead
of object creation on each request.  The problem with the HTablePool is that
it does not "ride over restart" meaning that if you need to restart your
cluster, HtablePool will still be pointing at the old ports and not realize
the cluster is back-up.  Hence there is
https://issues.apache.org/jira/browse/HBASE-2183.  Apparently it is slated
to be fixed in 0.92.

Not sure of the expected timing of 0.92, but probably not too far off so I'd
go with HtablePool if not too far off.

My two cents.

On Tue, Nov 9, 2010 at 9:29 AM, Hari Sreekumar <hs...@clickable.com>wrote:

> When is it preferable to use HTablePool over HTable and vice-versa? If I am
> working on just one table, will using HTablePool potentially give me any
> performance improvements?
>
> hari
>