You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "Cavus,M.,Fa. Post Direkt" <M....@postdirekt.de> on 2011/03/09 10:57:44 UTC

Re: Job is faster with not cluster than 4 cluster

Hi,
I don't have problems on maptasks. I've problems by reducetasks. If I
start
my job the maptasks are very fast, but the reduce jobs are very very
very slow. Reduce jobs are importing my datas.

Here are what am I see on my Web Interface:

Region Servers
Address Start Code Load 
slave1.local:60030 1298907779615 requests=0, regions=16, usedHeap=240,
maxHeap=8183 
slave2.local:60030 1298907780330 requests=0, regions=16, usedHeap=307,
maxHeap=8183 
slave3.local:60030 1298907778882 requests=0, regions=15, usedHeap=246,
maxHeap=8183 
slave4.local:60030 1298907780059 requests=0, regions=16, usedHeap=413,
maxHeap=8183 
Total:  servers: 4   requests=0, regions=63

On Mon, Feb 28, 2011 at 6:00 AM, Cavus,M.,Fa. Post Direkt
<M....@...> wrote:
> I've a simple job. It imports 2 GB of data in 4 minutes to hbase with
> hadoop and not cluster.
>
> If I configure full distributed mode, it imports 2 GB of data in 40
> minutes to my 4 clusters.
>

So, running a mapreduce job when all is in standalone mode runs in 4
minutes but distributed its 40 minutes?  That sounds a bit odd.  Can
you tell what is going on for 40 minutes?  How many maptasks?  How
many hbase regions?  Is it actually doing anything during this time?

St.Ack

Re: Many smaller tables vs one large table

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Well those regions are still distributed so, depending on the amount
of data you generate per day, you may have only 1 region per day but
even if you were using a single table and storing those rows next to
each other then the access pattern would stay the same no?

J-D

On Wed, Mar 9, 2011 at 3:37 PM, Peter Haidinyak <ph...@local.com> wrote:
> I do that now but if they were in different table I could thread that out with one thread per table. I'm just worried I lose the advantage of HBase and a distributed system if the table ends up on one region server.
>
> -Pete
>
> -----Original Message-----
> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
> Sent: Wednesday, March 09, 2011 3:14 PM
> To: user@hbase.apache.org
> Subject: Re: Many smaller tables vs one large table
>
> I guess it could be a good idea... do you need to be able to scan for
> data that's contained in more than one day?
>
> J-D
>
> On Wed, Mar 9, 2011 at 2:08 PM, Peter Haidinyak <ph...@local.com> wrote:
>> Hi all,
>>    Right now I am aggregating our log data and populating tables based on how we want to query the data later. Currently I have eleven different aggregation tables and the date is part of the Row key. Since we usually slice our data by day I was wondering if it would be better to create aggregation table by date. I would no longer have to use the date as part of the stop/end row keys in a scan and it would be easier to prune old data. I would also guess there would be less contention on tables between the process that populates the table and the processes that query the table. One of the only problems I see, with my limited knowledge about HBase, is the tables will end up being rather small and would most likely end up on one region server.
>>        Long story short, is this a good idea?
>>
>> Thanks
>>
>> -Pete
>>
>

RE: Many smaller tables vs one large table

Posted by Peter Haidinyak <ph...@local.com>.

I do that now but if they were in different table I could thread that out with one thread per table. I'm just worried I lose the advantage of HBase and a distributed system if the table ends up on one region server. 

-Pete

-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Wednesday, March 09, 2011 3:14 PM
To: user@hbase.apache.org
Subject: Re: Many smaller tables vs one large table

I guess it could be a good idea... do you need to be able to scan for
data that's contained in more than one day?

J-D

On Wed, Mar 9, 2011 at 2:08 PM, Peter Haidinyak <ph...@local.com> wrote:
> Hi all,
>    Right now I am aggregating our log data and populating tables based on how we want to query the data later. Currently I have eleven different aggregation tables and the date is part of the Row key. Since we usually slice our data by day I was wondering if it would be better to create aggregation table by date. I would no longer have to use the date as part of the stop/end row keys in a scan and it would be easier to prune old data. I would also guess there would be less contention on tables between the process that populates the table and the processes that query the table. One of the only problems I see, with my limited knowledge about HBase, is the tables will end up being rather small and would most likely end up on one region server.
>        Long story short, is this a good idea?
>
> Thanks
>
> -Pete
>

Re: Many smaller tables vs one large table

Posted by Jean-Daniel Cryans <jd...@apache.org>.

I guess it could be a good idea... do you need to be able to scan for
data that's contained in more than one day?

J-D

On Wed, Mar 9, 2011 at 2:08 PM, Peter Haidinyak <ph...@local.com> wrote:
> Hi all,
>    Right now I am aggregating our log data and populating tables based on how we want to query the data later. Currently I have eleven different aggregation tables and the date is part of the Row key. Since we usually slice our data by day I was wondering if it would be better to create aggregation table by date. I would no longer have to use the date as part of the stop/end row keys in a scan and it would be easier to prune old data. I would also guess there would be less contention on tables between the process that populates the table and the processes that query the table. One of the only problems I see, with my limited knowledge about HBase, is the tables will end up being rather small and would most likely end up on one region server.
>        Long story short, is this a good idea?
>
> Thanks
>
> -Pete
>

Many smaller tables vs one large table

Posted by Peter Haidinyak <ph...@local.com>.

Hi all,
    Right now I am aggregating our log data and populating tables based on how we want to query the data later. Currently I have eleven different aggregation tables and the date is part of the Row key. Since we usually slice our data by day I was wondering if it would be better to create aggregation table by date. I would no longer have to use the date as part of the stop/end row keys in a scan and it would be easier to prune old data. I would also guess there would be less contention on tables between the process that populates the table and the processes that query the table. One of the only problems I see, with my limited knowledge about HBase, is the tables will end up being rather small and would most likely end up on one region server.
	Long story short, is this a good idea?

Thanks

-Pete

RE: Job is faster with not cluster than 4 cluster

Posted by "Cavus,M.,Fa. Post Direkt" <M....@postdirekt.de>.

"Do the maptasks go to hbase?"
No it doesn't. Only the reducetasks go to hbase.

" At what point in the reduce do you see the above?  IIRC only at the
post 66% does reduce start entering data into hbase.  If you are
beyond this point, something else is up.  Try thread dumping a child
task or adding debugging to try narrow in on where the holdup."

I this until %66. After this I see only some requests. My problem is that the job is slower with 4 cluster then no cluster.

Regards
Musa

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Wednesday, March 09, 2011 9:50 PM
To: user@hbase.apache.org
Cc: Cavus,M.,Fa. Post Direkt
Subject: Re: Job is faster with not cluster than 4 cluster

On Wed, Mar 9, 2011 at 1:57 AM, Cavus,M.,Fa. Post Direkt
<M....@postdirekt.de> wrote:
> Hi,
> I don't have problems on maptasks.

Do the maptasks go to hbase?

> Here are what am I see on my Web Interface:
>
> Region Servers
> Address Start Code Load
> slave1.local:60030 1298907779615 requests=0, regions=16, usedHeap=240,
> maxHeap=8183
> slave2.local:60030 1298907780330 requests=0, regions=16, usedHeap=307,
> maxHeap=8183
> slave3.local:60030 1298907778882 requests=0, regions=15, usedHeap=246,
> maxHeap=8183
> slave4.local:60030 1298907780059 requests=0, regions=16, usedHeap=413,
> maxHeap=8183
> Total:  servers: 4   requests=0, regions=63
>

At what point in the reduce do you see the above?  IIRC only at the
post 66% does reduce start entering data into hbase.  If you are
beyond this point, something else is up.  Try thread dumping a child
task or adding debugging to try narrow in on where the holdup.

St.Ack

Re: Job is faster with not cluster than 4 cluster

Posted by Stack <st...@duboce.net>.

On Wed, Mar 9, 2011 at 1:57 AM, Cavus,M.,Fa. Post Direkt
<M....@postdirekt.de> wrote:
> Hi,
> I don't have problems on maptasks.

Do the maptasks go to hbase?

> Here are what am I see on my Web Interface:
>
> Region Servers
> Address Start Code Load
> slave1.local:60030 1298907779615 requests=0, regions=16, usedHeap=240,
> maxHeap=8183
> slave2.local:60030 1298907780330 requests=0, regions=16, usedHeap=307,
> maxHeap=8183
> slave3.local:60030 1298907778882 requests=0, regions=15, usedHeap=246,
> maxHeap=8183
> slave4.local:60030 1298907780059 requests=0, regions=16, usedHeap=413,
> maxHeap=8183
> Total:  servers: 4   requests=0, regions=63
>

At what point in the reduce do you see the above?  IIRC only at the
post 66% does reduce start entering data into hbase.  If you are
beyond this point, something else is up.  Try thread dumping a child
task or adding debugging to try narrow in on where the holdup.

St.Ack