You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by shourabh rawat <mi...@gmail.com> on 2009/02/17 18:01:47 UTC

Improving hbase read performance

I am trying to improve the read performance of Hbase.
Can you suggest me some ideas
Using java.

Re: Improving hbase read performance

Posted by shourabh rawat <mi...@gmail.com>.
here's wat i m doin...

this is my get function
it should retrieve entities in parallel by creating parallel threads
for each get.

public String[] get(String tableName,String[] entityIDS){
            ExecutorService threadExecutor = Executors.newFixedThreadPool(50);
            String[] contents = new String[entityIDS.length];
            long initime=System.currentTimeMillis();
            int i = 0;
            while (i < entityIDS.length) {
                threadExecutor.execute(new ReadThread(conf,tableName,
contents, entityIDS[i], i));
                i++;
            }
            threadExecutor.shutdown();
            while(!threadExecutor.isTerminated());
            return contents;
    }


and here's the thread

    public void run() {
        long ab=System.currentTimeMillis();
        try {
             Cell c=table.get(entityID, "content:");
             String content=new String(c.getValue());
            if(content==null) j[index]="NULL";
            else {
                    j[index]=content;
            }
        } catch (IOException ex) {
            Logger.getLogger(ReadThread.class.getName()).log(Level.SEVERE,
null, ex);
        }
        System.out.println(System.currentTimeMillis()-ab + " " + "time
taken to complete for " + "process " + index);
    }

i m creating new htable instance for each such thread
Is this way correct.....would i get a better performance from this.
will my get queries be executed in parallel by the hbase



On Wed, Feb 18, 2009 at 11:27 AM, shourabh rawat <mi...@gmail.com> wrote:
> does the number of regionservers affect this performance??
>
> On Wed, Feb 18, 2009 at 11:23 AM, shourabh rawat <mi...@gmail.com> wrote:
>> hey,
>>
>> "> What do you mean by the above when you say read sequentially? Are you
>>> scanning? (Getting a scanner and then nexting through your hbase table?)."
>>
>> well lets say i have 10 keys that are stored in hbase
>> i want to retrive them
>>
>> If I do the reads one by one the time would be summation of  'get'
>> times of each key
>> Could i do the same thing in parallel. so that all the get's cld occur
>> concurrently so i would get total time as the max of the time taken by
>> any of these keys rather than the summ of individual times
>>
>>
>> "
>>> You will have to wait for hbase 0.20.0 or do as Erik suggests and put a
>>> cache in front of hbase.  What are you trying to do with hbase?  Serve a
>>> website? "
>>
>> ya sort of but i want to check performance withought the use of cache
>> (random reads) ....can i get such performance in the range of 10 ms
>> with hbase
>>
>>> Yeah, the RPC keeps a single connection per remote server but channel is
>>> shared by request and receive.  Testing in past, the more remote servers,
>>> the better, but even if a few only, concurrent HTables got better throughput
>>> than one running requests in series (the single connection is not fully
>>> occupied by requests and responses).
>>>
>>
>> so by a single connection u mean all the gets wld be treated
>> sequentially (one by one) by the hbase even wen the requests come in
>> parallel(even wen different htable instances for the same table are
>> employed)....is there any way i can make it parallel.....
>> The hbase master has one port that it specifies and other is the port
>> for the hdfs (hadoop)....what can be done to increase the number of
>> connection as u said.......
>>
>>
>> Thanx for yr help.
>>
>

Re: Improving hbase read performance

Posted by shourabh rawat <mi...@gmail.com>.
does the number of regionservers affect this performance??

On Wed, Feb 18, 2009 at 11:23 AM, shourabh rawat <mi...@gmail.com> wrote:
> hey,
>
> "> What do you mean by the above when you say read sequentially? Are you
>> scanning? (Getting a scanner and then nexting through your hbase table?)."
>
> well lets say i have 10 keys that are stored in hbase
> i want to retrive them
>
> If I do the reads one by one the time would be summation of  'get'
> times of each key
> Could i do the same thing in parallel. so that all the get's cld occur
> concurrently so i would get total time as the max of the time taken by
> any of these keys rather than the summ of individual times
>
>
> "
>> You will have to wait for hbase 0.20.0 or do as Erik suggests and put a
>> cache in front of hbase.  What are you trying to do with hbase?  Serve a
>> website? "
>
> ya sort of but i want to check performance withought the use of cache
> (random reads) ....can i get such performance in the range of 10 ms
> with hbase
>
>> Yeah, the RPC keeps a single connection per remote server but channel is
>> shared by request and receive.  Testing in past, the more remote servers,
>> the better, but even if a few only, concurrent HTables got better throughput
>> than one running requests in series (the single connection is not fully
>> occupied by requests and responses).
>>
>
> so by a single connection u mean all the gets wld be treated
> sequentially (one by one) by the hbase even wen the requests come in
> parallel(even wen different htable instances for the same table are
> employed)....is there any way i can make it parallel.....
> The hbase master has one port that it specifies and other is the port
> for the hdfs (hadoop)....what can be done to increase the number of
> connection as u said.......
>
>
> Thanx for yr help.
>

Re: Improving hbase read performance

Posted by stack <st...@duboce.net>.
HBase manages to which regionserver a query goes.  The client figures where
the row you are querying is hosted -- caching its knowledge of cluster
geography -- and sends the request to the hosting regionserver.

With a small cluster like yours, a threaded client where each thread does
lots of getting will give you better performance.  There is a relatively
large setup cost per task in MR so it'd probably run slower (MR would be
good for farming the requests out over the cluster and for ensuring they
complete).  For examples, see under src/example/mapred and study the
org.apache.hadoop.hbase.mapred package content.

No, hbase does not use MR as part of normal running.

St.Ack




On Wed, Feb 18, 2009 at 10:19 AM, shourabh rawat <mi...@gmail.com>wrote:

> hey few questions come to my mind,
>
> Can i send individual requests to each region server....If yes how
> how does the hbase handle my requests....does the hbase master
> distributes the requests among regionservers and do they process them
> in parallel....
> can i use a map/reduce to improve my read performance...and how.(each
> map wld be a get to the hbase and will each map run on a different
> hbase server)....
> does the hbase internally uses map/reduce for handling get request????
>
>
>
> On Wed, Feb 18, 2009 at 6:23 PM, stack <st...@duboce.net> wrote:
> > On Wed, Feb 18, 2009 at 8:39 AM, shourabh rawat <mirage1987@gmail.com
> >wrote:
> >
> >> Sorry to bug u again
> >
> >
> > Its no trouble. Lets figure it out.
> >
> >
> >
> >> well i pasted my code a few posts back...Is it the same as wat u r
> sayin...
> >>
> >
> > Pardon, I only just saw it.
> >
> > Looks like you are setting up a thread pool of 50 threads and then each
> time
> > the thread runs, it gets one value only?  Each thread makes its own
> HTable
> > instance?
> >
> > Set up a pool of 10 threads and have them each get 1000 values and see
> what
> > your numbers are like?  Or run ten processes each fetching 1000 values.
> >
> > I say 10 because with 50, the single Connection is probably a bottleneck.
>  I
> > also say 1000 so the cost of thread setup is amorticized..
> >
> > 0.20.0 hopefully will be out in a month or two.  There is still a bunch
> of
> > work to be done.
> >
> >
> > "You could also run multiple clients each to their own process so each
> > process got its own Connection instance."
> >
> > Didn't get wat u mean by  this...
> >> Well is it possible to get multiple connection instances. Isn't that
> >> the property of the HTables and with same name they alwyas have the
> >> same connection instances.
> >> Could you give some sample code which cld help me on this "multiple
> >> connection instances"
> >
> >
> > I was suggesting that you invoke your client program ten times,
> > concurrently: e.g for i in "1..10"; do java YOURPROGRAM &; done
> (something
> > like that).  You'd need to let it run longer so cost of jvm setup would
> wash
> > out.
> >
> > St.Ack
> >
>

Re: Improving hbase read performance

Posted by shourabh rawat <mi...@gmail.com>.
hey few questions come to my mind,

Can i send individual requests to each region server....If yes how
how does the hbase handle my requests....does the hbase master
distributes the requests among regionservers and do they process them
in parallel....
can i use a map/reduce to improve my read performance...and how.(each
map wld be a get to the hbase and will each map run on a different
hbase server)....
does the hbase internally uses map/reduce for handling get request????



On Wed, Feb 18, 2009 at 6:23 PM, stack <st...@duboce.net> wrote:
> On Wed, Feb 18, 2009 at 8:39 AM, shourabh rawat <mi...@gmail.com>wrote:
>
>> Sorry to bug u again
>
>
> Its no trouble. Lets figure it out.
>
>
>
>> well i pasted my code a few posts back...Is it the same as wat u r sayin...
>>
>
> Pardon, I only just saw it.
>
> Looks like you are setting up a thread pool of 50 threads and then each time
> the thread runs, it gets one value only?  Each thread makes its own HTable
> instance?
>
> Set up a pool of 10 threads and have them each get 1000 values and see what
> your numbers are like?  Or run ten processes each fetching 1000 values.
>
> I say 10 because with 50, the single Connection is probably a bottleneck.  I
> also say 1000 so the cost of thread setup is amorticized..
>
> 0.20.0 hopefully will be out in a month or two.  There is still a bunch of
> work to be done.
>
>
> "You could also run multiple clients each to their own process so each
> process got its own Connection instance."
>
> Didn't get wat u mean by  this...
>> Well is it possible to get multiple connection instances. Isn't that
>> the property of the HTables and with same name they alwyas have the
>> same connection instances.
>> Could you give some sample code which cld help me on this "multiple
>> connection instances"
>
>
> I was suggesting that you invoke your client program ten times,
> concurrently: e.g for i in "1..10"; do java YOURPROGRAM &; done (something
> like that).  You'd need to let it run longer so cost of jvm setup would wash
> out.
>
> St.Ack
>

Re: Improving hbase read performance

Posted by stack <st...@duboce.net>.
On Wed, Feb 18, 2009 at 8:39 AM, shourabh rawat <mi...@gmail.com>wrote:

> Sorry to bug u again


Its no trouble. Lets figure it out.



> well i pasted my code a few posts back...Is it the same as wat u r sayin...
>

Pardon, I only just saw it.

Looks like you are setting up a thread pool of 50 threads and then each time
the thread runs, it gets one value only?  Each thread makes its own HTable
instance?

Set up a pool of 10 threads and have them each get 1000 values and see what
your numbers are like?  Or run ten processes each fetching 1000 values.

I say 10 because with 50, the single Connection is probably a bottleneck.  I
also say 1000 so the cost of thread setup is amorticized..

0.20.0 hopefully will be out in a month or two.  There is still a bunch of
work to be done.


"You could also run multiple clients each to their own process so each
process got its own Connection instance."

Didn't get wat u mean by  this...
> Well is it possible to get multiple connection instances. Isn't that
> the property of the HTables and with same name they alwyas have the
> same connection instances.
> Could you give some sample code which cld help me on this "multiple
> connection instances"


I was suggesting that you invoke your client program ten times,
concurrently: e.g for i in "1..10"; do java YOURPROGRAM &; done (something
like that).  You'd need to let it run longer so cost of jvm setup would wash
out.

St.Ack

Re: Improving hbase read performance

Posted by shourabh rawat <mi...@gmail.com>.
Sorry to bug u again but this problem is troubling me a lot....

"Yes.  Do multple instances of HTable.  You won't do the ten requests in the
time it would take to do one.  It'll be more like the time to do 2 or 3 (at
least in my primitive testing).  If you had more regionservers, it would
complete in shorter time (its the single Connection issue you mentioned in
an earlier mail)."

well i pasted my code a few posts back...Is it the same as wat u r sayin...
But it doesn't seem to be improving my performance though.

here's the log
The search results size : 50
we are here
22 time taken to complete for process 0
49 time taken to complete for process 1
41 time taken to complete for process 3
91 time taken to complete for process 2
120 time taken to complete for process 4
22 time taken to complete for process 7
35 time taken to complete for process 5
64 time taken to complete for process 8
73 time taken to complete for process 9
93 time taken to complete for process 6
93 time taken to complete for process 11
109 time taken to complete for process 12
143 time taken to complete for process 13
119 time taken to complete for process 14
289 time taken to complete for process 10
8 time taken to complete for process 19
9 time taken to complete for process 18
69 time taken to complete for process 17
32 time taken to complete for process 21
10 time taken to complete for process 24
13 time taken to complete for process 25
13 time taken to complete for process 26
59 time taken to complete for process 20
48 time taken to complete for process 22
57 time taken to complete for process 23
29 time taken to complete for process 29
224 time taken to complete for process 15
96 time taken to complete for process 28
95 time taken to complete for process 30
241 time taken to complete for process 16
66 time taken to complete for process 31
65 time taken to complete for process 32
101 time taken to complete for process 33
68 time taken to complete for process 35
75 time taken to complete for process 36
261 time taken to complete for process 27
57 time taken to complete for process 37
136 time taken to complete for process 34
54 time taken to complete for process 39
88 time taken to complete for process 40
42 time taken to complete for process 41
49 time taken to complete for process 43
81 time taken to complete for process 42
9 time taken to complete for process 45
14 time taken to complete for process 47
17 time taken to complete for process 46
18 time taken to complete for process 48
53 time taken to complete for process 49
265 time taken to complete for process 38
181 time taken to complete for process 44
time taken1960
Time taken in milli seconds to get content for 50 entities from the
HBase is : 1961
as you could see time is quite high around 2 sec
i was expecting that with parallel threads time shld have been arnd
300 (265 is the max for any gets)
Could you figure out why this is happening...Is it that the process
are not running in parallel

"Depends on hardware, data, etc (See the wiki for the numbers I get with our
hardware and loading).

If this is important to you, you might wait on hbase 0.20.0.  Improving this
performance dimension is its focus.
"

well i m using the cluster of 3 .......1 master and 3 regionservers .....
cant wait for 0.20.0 ...need a solution now....nyways ny idea wen it'll be out





"You could also run multiple clients each to their own process so each
process got its own Connection instance."

Didn't get wat u mean by  this...
Well is it possible to get multiple connection instances. Isn't that
the property of the HTables and with same name they alwyas have the
same connection instances.
Could you give some sample code which cld help me on this "multiple
connection instances"

Thanx again......

Re: Improving hbase read performance

Posted by stack <st...@duboce.net>.
On Wed, Feb 18, 2009 at 2:23 AM, shourabh rawat <mi...@gmail.com>wrote:

> hey,
>
> "> What do you mean by the above when you say read sequentially? Are you
> > scanning? (Getting a scanner and then nexting through your hbase
> table?)."
>
> well lets say i have 10 keys that are stored in hbase
> i want to retrive them
>
> If I do the reads one by one the time would be summation of  'get'
> times of each key
> Could i do the same thing in parallel. so that all the get's cld occur
> concurrently so i would get total time as the max of the time taken by
> any of these keys rather than the summ of individual times


Yes.  Do multple instances of HTable.  You won't do the ten requests in the
time it would take to do one.  It'll be more like the time to do 2 or 3 (at
least in my primitive testing).  If you had more regionservers, it would
complete in shorter time (its the single Connection issue you mentioned in
an earlier mail).


>
> > You will have to wait for hbase 0.20.0 or do as Erik suggests and put a
> > cache in front of hbase.  What are you trying to do with hbase?  Serve a
> > website? "
>
> ya sort of but i want to check performance withought the use of cache
> (random reads) ....can i get such performance in the range of 10 ms
> with hbase
>

Depends on hardware, data, etc (See the wiki for the numbers I get with our
hardware and loading).

If this is important to you, you might wait on hbase 0.20.0.  Improving this
performance dimension is its focus.


> so by a single connection u mean all the gets wld be treated
> sequentially (one by one) by the hbase even wen the requests come in
> parallel(even wen different htable instances for the same table are
> employed)...


It does not do a request, wait for the response and then return the
response.  It interleaves the sending of requests and responses so you'll
see something like this:

request1
request2
response1
request3
request4
request5
response2
.....

This is how the hadoop RPC works.  Its what we currently use.

You could also run multiple clients each to their own process so each
process got its own Connection instance.

St.Ack

Re: Improving hbase read performance

Posted by shourabh rawat <mi...@gmail.com>.
hey,

"> What do you mean by the above when you say read sequentially? Are you
> scanning? (Getting a scanner and then nexting through your hbase table?)."

well lets say i have 10 keys that are stored in hbase
i want to retrive them

If I do the reads one by one the time would be summation of  'get'
times of each key
Could i do the same thing in parallel. so that all the get's cld occur
concurrently so i would get total time as the max of the time taken by
any of these keys rather than the summ of individual times


"
> You will have to wait for hbase 0.20.0 or do as Erik suggests and put a
> cache in front of hbase.  What are you trying to do with hbase?  Serve a
> website? "

ya sort of but i want to check performance withought the use of cache
(random reads) ....can i get such performance in the range of 10 ms
with hbase

> Yeah, the RPC keeps a single connection per remote server but channel is
> shared by request and receive.  Testing in past, the more remote servers,
> the better, but even if a few only, concurrent HTables got better throughput
> than one running requests in series (the single connection is not fully
> occupied by requests and responses).
>

so by a single connection u mean all the gets wld be treated
sequentially (one by one) by the hbase even wen the requests come in
parallel(even wen different htable instances for the same table are
employed)....is there any way i can make it parallel.....
The hbase master has one port that it specifies and other is the port
for the hdfs (hadoop)....what can be done to increase the number of
connection as u said.......


Thanx for yr help.

Re: Improving hbase read performance

Posted by stack <st...@duboce.net>.
On Tue, Feb 17, 2009 at 11:29 AM, shourabh rawat <mi...@gmail.com>wrote:

> Thanx for replying....
> Well the problem is this.
> I have a distributed setup of hbase over hadoop(a cluster of 3).
> I have loaded around 4 millions entries into my hbase.
> Now i want to read on it.(read a set of entries)
> Reading sequentially adds on the performance.


What do you mean by the above when you say read sequentially? Are you
scanning? (Getting a scanner and then nexting through your hbase table?).


>
> I want really good performance (i mean retrieval should be well within
> 10 ms per entry on an average)


You will have to wait for hbase 0.20.0 or do as Erik suggests and put a
cache in front of hbase.  What are you trying to do with hbase?  Serve a
website?



>
> So i thought of trying out the bulk read (but no such function on the hbase
> api)

 so i resorted to threads...created one htable instance per thread and

> did a get on the same table in parallel.
> But still the performance doesn't seem to get effected.
> Are u sure that the hbase treats them parallely or does it handle them
> sequentially even when thr are parallel request.


> Nyways wat is a good performance on a hbase...Any other way to improve
> on this performance...
> Can multiple instances of hbase be created (and not HTable as All the
> HTables are seem to be using the same connection
> i mean HConnection object).



Yeah, the RPC keeps a single connection per remote server but channel is
shared by request and receive.  Testing in past, the more remote servers,
the better, but even if a few only, concurrent HTables got better throughput
than one running requests in series (the single connection is not fully
occupied by requests and responses).

St.Ack


>
>
> Would be great if you could help me on this...and clear my concepts
>

Re: Improving hbase read performance

Posted by shourabh rawat <mi...@gmail.com>.
Thanx for replying....
Well the problem is this.
I have a distributed setup of hbase over hadoop(a cluster of 3).
I have loaded around 4 millions entries into my hbase.
Now i want to read on it.(read a set of entries)
Reading sequentially adds on the performance.
I want really good performance (i mean retrieval should be well within
10 ms per entry on an average)
So i thought of trying out the bulk read (but no such function on the hbase api)
so i resorted to threads...created one htable instance per thread and
did a get on the same table in parallel.
But still the performance doesn't seem to get effected.
Are u sure that the hbase treats them parallely or does it handle them
sequentially even when thr are parallel request.

Nyways wat is a good performance on a hbase...Any other way to improve
on this performance...
Can multiple instances of hbase be created (and not HTable as All the
HTables are seem to be using the same connection
i mean HConnection object).

Would be great if you could help me on this...and clear my concepts

Re: Improving hbase read performance

Posted by Erik Holstad <er...@gmail.com>.
Hey!
What can be done is to put your own cache in front of HBase that stores the
passed reads.

But we are currently working on this issue HBASE-80 and also with the new
file format HBASE-61.
Both of these issues will increase read performance once they are in place
which is scheduled for
0.20. But if you need to have it now you currently need to implement it
yourself.

Regards Erik

Improving hbase read performance

Posted by shourabh rawat <mi...@gmail.com>.
I am trying to improve the read performance of Hbase.
Can you suggest me some ideas
Using java.