You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by "Cavus,M.,Fa. Post Direkt" <M....@postdirekt.de> on 2011/03/09 10:33:06 UTC

HBase application is slow in 4-node cluster

Hi, everyone. I am using HBase to store and retrieval data. Now there
are four nodes. I found that when I run some applications using HBase,
it is not as fast as I expect, everytime much slower than centralization
locally in one node. I am not clear what the real reason: IO throughout
or HBase responding? Could someone expert at HBase can tell me?

Best Wishes!
Musa

Re: Many smaller tables vs one large table

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Well those regions are still distributed so, depending on the amount
of data you generate per day, you may have only 1 region per day but
even if you were using a single table and storing those rows next to
each other then the access pattern would stay the same no?

J-D

On Wed, Mar 9, 2011 at 3:37 PM, Peter Haidinyak <ph...@local.com> wrote:
> I do that now but if they were in different table I could thread that out with one thread per table. I'm just worried I lose the advantage of HBase and a distributed system if the table ends up on one region server.
>
> -Pete
>
> -----Original Message-----
> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
> Sent: Wednesday, March 09, 2011 3:14 PM
> To: user@hbase.apache.org
> Subject: Re: Many smaller tables vs one large table
>
> I guess it could be a good idea... do you need to be able to scan for
> data that's contained in more than one day?
>
> J-D
>
> On Wed, Mar 9, 2011 at 2:08 PM, Peter Haidinyak <ph...@local.com> wrote:
>> Hi all,
>>    Right now I am aggregating our log data and populating tables based on how we want to query the data later. Currently I have eleven different aggregation tables and the date is part of the Row key. Since we usually slice our data by day I was wondering if it would be better to create aggregation table by date. I would no longer have to use the date as part of the stop/end row keys in a scan and it would be easier to prune old data. I would also guess there would be less contention on tables between the process that populates the table and the processes that query the table. One of the only problems I see, with my limited knowledge about HBase, is the tables will end up being rather small and would most likely end up on one region server.
>>        Long story short, is this a good idea?
>>
>> Thanks
>>
>> -Pete
>>
>

RE: Many smaller tables vs one large table

Posted by Peter Haidinyak <ph...@local.com>.
I do that now but if they were in different table I could thread that out with one thread per table. I'm just worried I lose the advantage of HBase and a distributed system if the table ends up on one region server. 

-Pete

-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Wednesday, March 09, 2011 3:14 PM
To: user@hbase.apache.org
Subject: Re: Many smaller tables vs one large table

I guess it could be a good idea... do you need to be able to scan for
data that's contained in more than one day?

J-D

On Wed, Mar 9, 2011 at 2:08 PM, Peter Haidinyak <ph...@local.com> wrote:
> Hi all,
>    Right now I am aggregating our log data and populating tables based on how we want to query the data later. Currently I have eleven different aggregation tables and the date is part of the Row key. Since we usually slice our data by day I was wondering if it would be better to create aggregation table by date. I would no longer have to use the date as part of the stop/end row keys in a scan and it would be easier to prune old data. I would also guess there would be less contention on tables between the process that populates the table and the processes that query the table. One of the only problems I see, with my limited knowledge about HBase, is the tables will end up being rather small and would most likely end up on one region server.
>        Long story short, is this a good idea?
>
> Thanks
>
> -Pete
>

Re: Many smaller tables vs one large table

Posted by Jean-Daniel Cryans <jd...@apache.org>.
I guess it could be a good idea... do you need to be able to scan for
data that's contained in more than one day?

J-D

On Wed, Mar 9, 2011 at 2:08 PM, Peter Haidinyak <ph...@local.com> wrote:
> Hi all,
>    Right now I am aggregating our log data and populating tables based on how we want to query the data later. Currently I have eleven different aggregation tables and the date is part of the Row key. Since we usually slice our data by day I was wondering if it would be better to create aggregation table by date. I would no longer have to use the date as part of the stop/end row keys in a scan and it would be easier to prune old data. I would also guess there would be less contention on tables between the process that populates the table and the processes that query the table. One of the only problems I see, with my limited knowledge about HBase, is the tables will end up being rather small and would most likely end up on one region server.
>        Long story short, is this a good idea?
>
> Thanks
>
> -Pete
>

Many smaller tables vs one large table

Posted by Peter Haidinyak <ph...@local.com>.
Hi all,
    Right now I am aggregating our log data and populating tables based on how we want to query the data later. Currently I have eleven different aggregation tables and the date is part of the Row key. Since we usually slice our data by day I was wondering if it would be better to create aggregation table by date. I would no longer have to use the date as part of the stop/end row keys in a scan and it would be easier to prune old data. I would also guess there would be less contention on tables between the process that populates the table and the processes that query the table. One of the only problems I see, with my limited knowledge about HBase, is the tables will end up being rather small and would most likely end up on one region server.
	Long story short, is this a good idea?

Thanks

-Pete

RE: Job is faster with not cluster than 4 cluster

Posted by "Cavus,M.,Fa. Post Direkt" <M....@postdirekt.de>.
"Do the maptasks go to hbase?"
No it doesn't. Only the reducetasks go to hbase.

" At what point in the reduce do you see the above?  IIRC only at the
post 66% does reduce start entering data into hbase.  If you are
beyond this point, something else is up.  Try thread dumping a child
task or adding debugging to try narrow in on where the holdup."

I this until %66. After this I see only some requests. My problem is that the job is slower with 4 cluster then no cluster.

Regards
Musa

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Wednesday, March 09, 2011 9:50 PM
To: user@hbase.apache.org
Cc: Cavus,M.,Fa. Post Direkt
Subject: Re: Job is faster with not cluster than 4 cluster

On Wed, Mar 9, 2011 at 1:57 AM, Cavus,M.,Fa. Post Direkt
<M....@postdirekt.de> wrote:
> Hi,
> I don't have problems on maptasks.

Do the maptasks go to hbase?

> Here are what am I see on my Web Interface:
>
> Region Servers
> Address Start Code Load
> slave1.local:60030 1298907779615 requests=0, regions=16, usedHeap=240,
> maxHeap=8183
> slave2.local:60030 1298907780330 requests=0, regions=16, usedHeap=307,
> maxHeap=8183
> slave3.local:60030 1298907778882 requests=0, regions=15, usedHeap=246,
> maxHeap=8183
> slave4.local:60030 1298907780059 requests=0, regions=16, usedHeap=413,
> maxHeap=8183
> Total:  servers: 4   requests=0, regions=63
>

At what point in the reduce do you see the above?  IIRC only at the
post 66% does reduce start entering data into hbase.  If you are
beyond this point, something else is up.  Try thread dumping a child
task or adding debugging to try narrow in on where the holdup.

St.Ack

Re: Job is faster with not cluster than 4 cluster

Posted by Stack <st...@duboce.net>.
On Wed, Mar 9, 2011 at 1:57 AM, Cavus,M.,Fa. Post Direkt
<M....@postdirekt.de> wrote:
> Hi,
> I don't have problems on maptasks.

Do the maptasks go to hbase?

> Here are what am I see on my Web Interface:
>
> Region Servers
> Address Start Code Load
> slave1.local:60030 1298907779615 requests=0, regions=16, usedHeap=240,
> maxHeap=8183
> slave2.local:60030 1298907780330 requests=0, regions=16, usedHeap=307,
> maxHeap=8183
> slave3.local:60030 1298907778882 requests=0, regions=15, usedHeap=246,
> maxHeap=8183
> slave4.local:60030 1298907780059 requests=0, regions=16, usedHeap=413,
> maxHeap=8183
> Total:  servers: 4   requests=0, regions=63
>

At what point in the reduce do you see the above?  IIRC only at the
post 66% does reduce start entering data into hbase.  If you are
beyond this point, something else is up.  Try thread dumping a child
task or adding debugging to try narrow in on where the holdup.

St.Ack

Re: Job is faster with not cluster than 4 cluster

Posted by "Cavus,M.,Fa. Post Direkt" <M....@postdirekt.de>.
Hi,
I don't have problems on maptasks. I've problems by reducetasks. If I
start
my job the maptasks are very fast, but the reduce jobs are very very
very slow. Reduce jobs are importing my datas.

Here are what am I see on my Web Interface:

Region Servers
Address Start Code Load 
slave1.local:60030 1298907779615 requests=0, regions=16, usedHeap=240,
maxHeap=8183 
slave2.local:60030 1298907780330 requests=0, regions=16, usedHeap=307,
maxHeap=8183 
slave3.local:60030 1298907778882 requests=0, regions=15, usedHeap=246,
maxHeap=8183 
slave4.local:60030 1298907780059 requests=0, regions=16, usedHeap=413,
maxHeap=8183 
Total:  servers: 4   requests=0, regions=63


On Mon, Feb 28, 2011 at 6:00 AM, Cavus,M.,Fa. Post Direkt
<M....@...> wrote:
> I've a simple job. It imports 2 GB of data in 4 minutes to hbase with
> hadoop and not cluster.
>
> If I configure full distributed mode, it imports 2 GB of data in 40
> minutes to my 4 clusters.
>

So, running a mapreduce job when all is in standalone mode runs in 4
minutes but distributed its 40 minutes?  That sounds a bit odd.  Can
you tell what is going on for 40 minutes?  How many maptasks?  How
many hbase regions?  Is it actually doing anything during this time?

St.Ack


RE: HBase application is slow in 4-node cluster

Posted by "Cavus,M.,Fa. Post Direkt" <M....@postdirekt.de>.
Hi Jean-Daniel,

" I don't know much about how your job works (is it multithreaded, is it
a mapreduce job, etc) and it would be nice if you tell us more about
it, so I'm going to assume you are inserting it in a single thread."

my job is a mapreduce job. I've 4 node cluster. I don't have any
problems with my mapjobs. I only have problems with my reduce jobs.

" If you have a single thread inserting into a 1 machine HBase cluster,
then the data is stored once. If you have 4 machines, and you set the
replication to 3 which is the default, then 2GB becomes 6GB and it's
all inserted sequentially. I would expect a slow down."

I don't configure the replication, but it must import in 4 minutes,
because I've 4 cluster.

" Now, 40 minutes VS 4 minutes is an order of magnitude slower and it
doesn't seem right. Have you looked into where it's slow? Can you
investigate more and give us some other data points?"

I've install Hbase 0.90.1. Find attached my hbase configuration datas.



Regards
Musa


4 clusters? You mean 4 machines?

I don't know much about how your job works (is it multithreaded, is it
a mapreduce job, etc) and it would be nice if you tell us more about
it, so I'm going to assume you are inserting it in a single thread.

If you have a single thread inserting into a 1 machine HBase cluster,
then the data is stored once. If you have 4 machines, and you set the
replication to 3 which is the default, then 2GB becomes 6GB and it's
all inserted sequentially. I would expect a slow down.

Now, 40 minutes VS 4 minutes is an order of magnitude slower and it
doesn't seem right. Have you looked into where it's slow? Can you
investigate more and give us some other data points?

J-D

On Mon, Feb 28, 2011 at 6:00 AM, Cavus,M.,Fa. Post Direkt
<M....@...> wrote:
> Hi,
>
>
>
> I've a simple job. It imports 2 GB of data in 4 minutes to hbase with
> hadoop and not cluster.
>
> If I configure full distributed mode, it imports 2 GB of data in 40
> minutes to my 4 clusters.
>
>
>
> Did anyone have same problems?
>
>
>
> Regards
>
> Musa
>
>
>
>