You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Li Li <fa...@gmail.com> on 2014/07/22 04:30:02 UTC

Fwd: how to do parallel scanning in map reduce using hbase as input?

anyone could help? now I have about 1.1 billion nodes and it takes 2
hours to finish a map reduce job.

---------- Forwarded message ----------
From: Li Li <fa...@gmail.com>
Date: Thu, Jun 26, 2014 at 3:34 PM
Subject: how to do parallel scanning in map reduce using hbase as input?
To: user@hbase.apache.org

my table has about 700 million rows and about 80 regions. each task
tracker is configured with 4 mappers and 4 reducers at the same time.
The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
mappers running. it takes more than an hour to finish mapper stage.
The hbase cluster's load is very low, about 2,000 request per second.
I think one mapper for a region is too small. How can I run more than
one mapper for a region so that it can take full advantage of
computing resources?

Re: how to do parallel scanning in map reduce using hbase as input?

Posted by Li Li <fa...@gmail.com>.

that's great. what's the status of these issue? any patch available now?

On Fri, Jul 25, 2014 at 4:23 AM, Vladimir Rodionov
<vl...@gmail.com> wrote:
> I am working on improving inter-region scan performance and have the patch
> already. The patch will be committed as soon as all tests are done. This
> should improve M/R over HBase performance because now you will be able to
> create input splits with granularities lower than a region without loss of
> a performance.
>
> See :
>
> https://issues.apache.org/jira/browse/HBASE-7336
> https://issues.apache.org/jira/browse/HBASE-5979
>
> for more information on the subject.
>
> -Vladimir Rodionov
>
>
>
> On Tue, Jul 22, 2014 at 3:31 PM, Stack <st...@duboce.net> wrote:
>
>> On Mon, Jul 21, 2014 at 11:11 PM, Li Li <fa...@gmail.com> wrote:
>>
>> > On Tue, Jul 22, 2014 at 1:57 PM, Stack <st...@duboce.net> wrote:
>> > > On Mon, Jul 21, 2014 at 10:53 PM, Li Li <fa...@gmail.com> wrote:
>> > >
>> > >> Sorry, I enter tab and it send my unfinished post. See the following
>> > >> mail for answers of other questions.
>> > >>
>> > >> I forget the exception's detail. It throws exception in terminal.
>> > >
>> > >
>> > > What exception is thrown?
>> > I forget it. maybe I can retry it with 8 mapper configuration. it
>> > seems like out of memory exception
>> >
>>
>>
>> Who OOME'd?  The map task or hbase?
>>
>>
>>
>> > >
>> > >
>> > >
>> > >> The
>> > >> default io.sort.mb is 100 and I set it to 500 to speed up reducer.
>> > >
>> > >
>> > > Do you have to have a reducer?  If you could skip the shuffle...
>> > I have 8 reducers
>> >
>>
>>
>> Do you have to reduce?
>>
>> Would more reducers make your job run faster?
>>
>>
>>
>> > >
>> > >
>> > >
>> > >> So
>> > >> I set mapred.child.java.opts to 1g
>> > >> The datanode/regionserver has 16GB memory but free memory
>> > >
>> > >
>> > > Does the RS use the 16G?
>> > the RS use 8G and there are datanode and tasktracker in this machine
>> > >
>> >
>>
>>
>> How much for DN and TT?  They don't need much usually.
>>
>>
>>
>> > >
>> > >
>> > >> for
>> > >> map-reduce is about 5gb. So I can't add more mappers
>> > >>
>> > >>
>> > >> How much RAM in these machines?
>> > 16GB
>>
>>
>>
>> These your machines or EC2?  Can you get bigger machines if EC2?
>>
>> St.Ack
>>

Re: how to do parallel scanning in map reduce using hbase as input?

Posted by Vladimir Rodionov <vl...@gmail.com>.

I am working on improving inter-region scan performance and have the patch
already. The patch will be committed as soon as all tests are done. This
should improve M/R over HBase performance because now you will be able to
create input splits with granularities lower than a region without loss of
a performance.

See :

https://issues.apache.org/jira/browse/HBASE-7336
https://issues.apache.org/jira/browse/HBASE-5979

for more information on the subject.

-Vladimir Rodionov



On Tue, Jul 22, 2014 at 3:31 PM, Stack <st...@duboce.net> wrote:

> On Mon, Jul 21, 2014 at 11:11 PM, Li Li <fa...@gmail.com> wrote:
>
> > On Tue, Jul 22, 2014 at 1:57 PM, Stack <st...@duboce.net> wrote:
> > > On Mon, Jul 21, 2014 at 10:53 PM, Li Li <fa...@gmail.com> wrote:
> > >
> > >> Sorry, I enter tab and it send my unfinished post. See the following
> > >> mail for answers of other questions.
> > >>
> > >> I forget the exception's detail. It throws exception in terminal.
> > >
> > >
> > > What exception is thrown?
> > I forget it. maybe I can retry it with 8 mapper configuration. it
> > seems like out of memory exception
> >
>
>
> Who OOME'd?  The map task or hbase?
>
>
>
> > >
> > >
> > >
> > >> The
> > >> default io.sort.mb is 100 and I set it to 500 to speed up reducer.
> > >
> > >
> > > Do you have to have a reducer?  If you could skip the shuffle...
> > I have 8 reducers
> >
>
>
> Do you have to reduce?
>
> Would more reducers make your job run faster?
>
>
>
> > >
> > >
> > >
> > >> So
> > >> I set mapred.child.java.opts to 1g
> > >> The datanode/regionserver has 16GB memory but free memory
> > >
> > >
> > > Does the RS use the 16G?
> > the RS use 8G and there are datanode and tasktracker in this machine
> > >
> >
>
>
> How much for DN and TT?  They don't need much usually.
>
>
>
> > >
> > >
> > >> for
> > >> map-reduce is about 5gb. So I can't add more mappers
> > >>
> > >>
> > >> How much RAM in these machines?
> > 16GB
>
>
>
> These your machines or EC2?  Can you get bigger machines if EC2?
>
> St.Ack
>

Re: how to do parallel scanning in map reduce using hbase as input?

Posted by Li Li <fa...@gmail.com>.

On Wed, Jul 23, 2014 at 6:31 AM, Stack <st...@duboce.net> wrote:
> On Mon, Jul 21, 2014 at 11:11 PM, Li Li <fa...@gmail.com> wrote:
>
>> On Tue, Jul 22, 2014 at 1:57 PM, Stack <st...@duboce.net> wrote:
>> > On Mon, Jul 21, 2014 at 10:53 PM, Li Li <fa...@gmail.com> wrote:
>> >
>> >> Sorry, I enter tab and it send my unfinished post. See the following
>> >> mail for answers of other questions.
>> >>
>> >> I forget the exception's detail. It throws exception in terminal.
>> >
>> >
>> > What exception is thrown?
>> I forget it. maybe I can retry it with 8 mapper configuration. it
>> seems like out of memory exception
>>
>
>
> Who OOME'd?  The map task or hbase?
map task
>
>
>
>> >
>> >
>> >
>> >> The
>> >> default io.sort.mb is 100 and I set it to 500 to speed up reducer.
>> >
>> >
>> > Do you have to have a reducer?  If you could skip the shuffle...
>> I have 8 reducers
>>
>
>
> Do you have to reduce?
yes, I have to do reduce task
>
> Would more reducers make your job run faster?
>
>
>
>> >
>> >
>> >
>> >> So
>> >> I set mapred.child.java.opts to 1g
>> >> The datanode/regionserver has 16GB memory but free memory
>> >
>> >
>> > Does the RS use the 16G?
>> the RS use 8G and there are datanode and tasktracker in this machine
>> >
>>
>
>
> How much for DN and TT?  They don't need much usually.
>
>
>
>> >
>> >
>> >> for
>> >> map-reduce is about 5gb. So I can't add more mappers
>> >>
>> >>
>> >> How much RAM in these machines?
>> 16GB
>
>
>
> These your machines or EC2?  Can you get bigger machines if EC2?
>
> St.Ack

Re: how to do parallel scanning in map reduce using hbase as input?

Posted by Stack <st...@duboce.net>.

On Mon, Jul 21, 2014 at 11:11 PM, Li Li <fa...@gmail.com> wrote:

> On Tue, Jul 22, 2014 at 1:57 PM, Stack <st...@duboce.net> wrote:
> > On Mon, Jul 21, 2014 at 10:53 PM, Li Li <fa...@gmail.com> wrote:
> >
> >> Sorry, I enter tab and it send my unfinished post. See the following
> >> mail for answers of other questions.
> >>
> >> I forget the exception's detail. It throws exception in terminal.
> >
> >
> > What exception is thrown?
> I forget it. maybe I can retry it with 8 mapper configuration. it
> seems like out of memory exception
>


Who OOME'd?  The map task or hbase?



> >
> >
> >
> >> The
> >> default io.sort.mb is 100 and I set it to 500 to speed up reducer.
> >
> >
> > Do you have to have a reducer?  If you could skip the shuffle...
> I have 8 reducers
>


Do you have to reduce?

Would more reducers make your job run faster?



> >
> >
> >
> >> So
> >> I set mapred.child.java.opts to 1g
> >> The datanode/regionserver has 16GB memory but free memory
> >
> >
> > Does the RS use the 16G?
> the RS use 8G and there are datanode and tasktracker in this machine
> >
>


How much for DN and TT?  They don't need much usually.



> >
> >
> >> for
> >> map-reduce is about 5gb. So I can't add more mappers
> >>
> >>
> >> How much RAM in these machines?
> 16GB



These your machines or EC2?  Can you get bigger machines if EC2?

St.Ack

Re: how to do parallel scanning in map reduce using hbase as input?

Posted by Li Li <fa...@gmail.com>.

On Tue, Jul 22, 2014 at 1:57 PM, Stack <st...@duboce.net> wrote:
> On Mon, Jul 21, 2014 at 10:53 PM, Li Li <fa...@gmail.com> wrote:
>
>> Sorry, I enter tab and it send my unfinished post. See the following
>> mail for answers of other questions.
>>
>> I forget the exception's detail. It throws exception in terminal.
>
>
> What exception is thrown?
I forget it. maybe I can retry it with 8 mapper configuration. it
seems like out of memory exception
>
>
>
>> The
>> default io.sort.mb is 100 and I set it to 500 to speed up reducer.
>
>
> Do you have to have a reducer?  If you could skip the shuffle...
I have 8 reducers
>
>
>
>> So
>> I set mapred.child.java.opts to 1g
>> The datanode/regionserver has 16GB memory but free memory
>
>
> Does the RS use the 16G?
the RS use 8G and there are datanode and tasktracker in this machine
>
>
>
>> for
>> map-reduce is about 5gb. So I can't add more mappers
>>
>>
>> How much RAM in these machines?
16GB
> St.Ack
>
>
>
>>
>> On Tue, Jul 22, 2014 at 1:37 PM, Stack <st...@duboce.net> wrote:
>> > On Mon, Jul 21, 2014 at 10:32 PM, Li Li <fa...@gmail.com> wrote:
>> >
>> >> 1. yes, I have 20 concurrent running mappers.
>> >> 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
>> >> set 8 mappers, it hit oov exception and load average is high
>> >>
>> >
>> >
>> > What is OOV?
>> >
>> > Do you have to have a reducer?
>> >
>> > Load average is high?  How high?
>> >
>> >
>> >
>> >> 3. fast mapper only use 1 minute. following is the statistics
>> >>
>> >
>> >
>> > So each region is only taking 1 minute to scan?  1.4Gs scanned?
>> >
>> > Can you add other counters to your MR job so we can get more of an idea
>> of
>> > what is going on in it?
>> >
>> > Please answer my other questions.
>> >
>> > Thanks,
>> > St.Ack
>> >
>> >
>> >> HBase Counters
>> >> REMOTE_RPC_CALLS 0
>> >> RPC_CALLS 523
>> >> RPC_RETRIES 0
>> >> NOT_SERVING_REGION_EXCEPTION 0
>> >> NUM_SCANNER_RESTARTS 0
>> >> MILLIS_BETWEEN_NEXTS 62,415
>> >> BYTES_IN_RESULTS 1,380,694,667
>> >> BYTES_IN_REMOTE_RESULTS 0
>> >> REGIONS_SCANNED 1
>> >> REMOTE_RPC_RETRIES 0
>> >>
>> >> FileSystemCounters
>> >> FILE_BYTES_READ 120,508,552
>> >> HDFS_BYTES_READ 176
>> >> FILE_BYTES_WRITTEN 241,000,600
>> >>
>> >> File Input Format Counters
>> >> Bytes Read 0
>> >>
>> >> Map-Reduce Framework
>> >> Map output materialized bytes 120,448,992
>> >> Combine output records 0
>> >> Map input records 5,208,607
>> >> Physical memory (bytes) snapshot 965,730,304
>> >> Spilled Records 10,417,214
>> >> Map output bytes 282,122,973
>> >> CPU time spent (ms) 82,610
>> >> Total committed heap usage (bytes) 1,061,158,912
>> >> Virtual memory (bytes) snapshot 1,681,047,552
>> >> Combine input records 0
>> >> Map output records 5,208,607
>> >> SPLIT_RAW_BYTES 176
>> >>
>> >>
>> >> On Tue, Jul 22, 2014 at 12:11 PM, Stack <st...@duboce.net> wrote:
>> >> > How many regions now?
>> >> >
>> >> > You still have 20 concurrent mappers running?  Are your machines
>> loaded
>> >> w/
>> >> > 4 map tasks on each?  Can you up the number of concurrent mappers?
>>  Can
>> >> you
>> >> > get an idea of your scan rates?  Are all map tasks scanning at same
>> rate?
>> >> >  Does one task lag the others?  Do you emit stats on each map task
>> such
>> >> as
>> >> > rows processed? Can you figure your bottleneck? Are you seeking disk
>> all
>> >> > the time?  Anything else running while this big scan is going on?  How
>> >> big
>> >> > are your cells?  Do you have one or more column families?  How many
>> >> columns?
>> >> >
>> >> > For average region size, do du on the hdfs region directories and then
>> >> sum
>> >> > and divide by region count.
>> >> >
>> >> > St.Ack
>> >> >
>> >> >
>> >> > On Mon, Jul 21, 2014 at 7:30 PM, Li Li <fa...@gmail.com> wrote:
>> >> >
>> >> >> anyone could help? now I have about 1.1 billion nodes and it takes 2
>> >> >> hours to finish a map reduce job.
>> >> >>
>> >> >> ---------- Forwarded message ----------
>> >> >> From: Li Li <fa...@gmail.com>
>> >> >> Date: Thu, Jun 26, 2014 at 3:34 PM
>> >> >> Subject: how to do parallel scanning in map reduce using hbase as
>> input?
>> >> >> To: user@hbase.apache.org
>> >> >>
>> >> >>
>> >> >> my table has about 700 million rows and about 80 regions. each task
>> >> >> tracker is configured with 4 mappers and 4 reducers at the same time.
>> >> >> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
>> >> >> mappers running. it takes more than an hour to finish mapper stage.
>> >> >> The hbase cluster's load is very low, about 2,000 request per second.
>> >> >> I think one mapper for a region is too small. How can I run more than
>> >> >> one mapper for a region so that it can take full advantage of
>> >> >> computing resources?
>> >> >>
>> >>
>>

Re: how to do parallel scanning in map reduce using hbase as input?

Posted by Stack <st...@duboce.net>.

On Mon, Jul 21, 2014 at 10:53 PM, Li Li <fa...@gmail.com> wrote:

> Sorry, I enter tab and it send my unfinished post. See the following
> mail for answers of other questions.
>
> I forget the exception's detail. It throws exception in terminal.


What exception is thrown?



> The
> default io.sort.mb is 100 and I set it to 500 to speed up reducer.


Do you have to have a reducer?  If you could skip the shuffle...



> So
> I set mapred.child.java.opts to 1g
> The datanode/regionserver has 16GB memory but free memory


Does the RS use the 16G?



> for
> map-reduce is about 5gb. So I can't add more mappers
>
>
> How much RAM in these machines?
St.Ack



>
> On Tue, Jul 22, 2014 at 1:37 PM, Stack <st...@duboce.net> wrote:
> > On Mon, Jul 21, 2014 at 10:32 PM, Li Li <fa...@gmail.com> wrote:
> >
> >> 1. yes, I have 20 concurrent running mappers.
> >> 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
> >> set 8 mappers, it hit oov exception and load average is high
> >>
> >
> >
> > What is OOV?
> >
> > Do you have to have a reducer?
> >
> > Load average is high?  How high?
> >
> >
> >
> >> 3. fast mapper only use 1 minute. following is the statistics
> >>
> >
> >
> > So each region is only taking 1 minute to scan?  1.4Gs scanned?
> >
> > Can you add other counters to your MR job so we can get more of an idea
> of
> > what is going on in it?
> >
> > Please answer my other questions.
> >
> > Thanks,
> > St.Ack
> >
> >
> >> HBase Counters
> >> REMOTE_RPC_CALLS 0
> >> RPC_CALLS 523
> >> RPC_RETRIES 0
> >> NOT_SERVING_REGION_EXCEPTION 0
> >> NUM_SCANNER_RESTARTS 0
> >> MILLIS_BETWEEN_NEXTS 62,415
> >> BYTES_IN_RESULTS 1,380,694,667
> >> BYTES_IN_REMOTE_RESULTS 0
> >> REGIONS_SCANNED 1
> >> REMOTE_RPC_RETRIES 0
> >>
> >> FileSystemCounters
> >> FILE_BYTES_READ 120,508,552
> >> HDFS_BYTES_READ 176
> >> FILE_BYTES_WRITTEN 241,000,600
> >>
> >> File Input Format Counters
> >> Bytes Read 0
> >>
> >> Map-Reduce Framework
> >> Map output materialized bytes 120,448,992
> >> Combine output records 0
> >> Map input records 5,208,607
> >> Physical memory (bytes) snapshot 965,730,304
> >> Spilled Records 10,417,214
> >> Map output bytes 282,122,973
> >> CPU time spent (ms) 82,610
> >> Total committed heap usage (bytes) 1,061,158,912
> >> Virtual memory (bytes) snapshot 1,681,047,552
> >> Combine input records 0
> >> Map output records 5,208,607
> >> SPLIT_RAW_BYTES 176
> >>
> >>
> >> On Tue, Jul 22, 2014 at 12:11 PM, Stack <st...@duboce.net> wrote:
> >> > How many regions now?
> >> >
> >> > You still have 20 concurrent mappers running?  Are your machines
> loaded
> >> w/
> >> > 4 map tasks on each?  Can you up the number of concurrent mappers?
>  Can
> >> you
> >> > get an idea of your scan rates?  Are all map tasks scanning at same
> rate?
> >> >  Does one task lag the others?  Do you emit stats on each map task
> such
> >> as
> >> > rows processed? Can you figure your bottleneck? Are you seeking disk
> all
> >> > the time?  Anything else running while this big scan is going on?  How
> >> big
> >> > are your cells?  Do you have one or more column families?  How many
> >> columns?
> >> >
> >> > For average region size, do du on the hdfs region directories and then
> >> sum
> >> > and divide by region count.
> >> >
> >> > St.Ack
> >> >
> >> >
> >> > On Mon, Jul 21, 2014 at 7:30 PM, Li Li <fa...@gmail.com> wrote:
> >> >
> >> >> anyone could help? now I have about 1.1 billion nodes and it takes 2
> >> >> hours to finish a map reduce job.
> >> >>
> >> >> ---------- Forwarded message ----------
> >> >> From: Li Li <fa...@gmail.com>
> >> >> Date: Thu, Jun 26, 2014 at 3:34 PM
> >> >> Subject: how to do parallel scanning in map reduce using hbase as
> input?
> >> >> To: user@hbase.apache.org
> >> >>
> >> >>
> >> >> my table has about 700 million rows and about 80 regions. each task
> >> >> tracker is configured with 4 mappers and 4 reducers at the same time.
> >> >> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
> >> >> mappers running. it takes more than an hour to finish mapper stage.
> >> >> The hbase cluster's load is very low, about 2,000 request per second.
> >> >> I think one mapper for a region is too small. How can I run more than
> >> >> one mapper for a region so that it can take full advantage of
> >> >> computing resources?
> >> >>
> >>
>

Re: how to do parallel scanning in map reduce using hbase as input?

Posted by Li Li <fa...@gmail.com>.

Sorry, I enter tab and it send my unfinished post. See the following
mail for answers of other questions.

I forget the exception's detail. It throws exception in terminal. The
default io.sort.mb is 100 and I set it to 500 to speed up reducer. So
I set mapred.child.java.opts to 1g
The datanode/regionserver has 16GB memory but free memory for
map-reduce is about 5gb. So I can't add more mappers



On Tue, Jul 22, 2014 at 1:37 PM, Stack <st...@duboce.net> wrote:
> On Mon, Jul 21, 2014 at 10:32 PM, Li Li <fa...@gmail.com> wrote:
>
>> 1. yes, I have 20 concurrent running mappers.
>> 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
>> set 8 mappers, it hit oov exception and load average is high
>>
>
>
> What is OOV?
>
> Do you have to have a reducer?
>
> Load average is high?  How high?
>
>
>
>> 3. fast mapper only use 1 minute. following is the statistics
>>
>
>
> So each region is only taking 1 minute to scan?  1.4Gs scanned?
>
> Can you add other counters to your MR job so we can get more of an idea of
> what is going on in it?
>
> Please answer my other questions.
>
> Thanks,
> St.Ack
>
>
>> HBase Counters
>> REMOTE_RPC_CALLS 0
>> RPC_CALLS 523
>> RPC_RETRIES 0
>> NOT_SERVING_REGION_EXCEPTION 0
>> NUM_SCANNER_RESTARTS 0
>> MILLIS_BETWEEN_NEXTS 62,415
>> BYTES_IN_RESULTS 1,380,694,667
>> BYTES_IN_REMOTE_RESULTS 0
>> REGIONS_SCANNED 1
>> REMOTE_RPC_RETRIES 0
>>
>> FileSystemCounters
>> FILE_BYTES_READ 120,508,552
>> HDFS_BYTES_READ 176
>> FILE_BYTES_WRITTEN 241,000,600
>>
>> File Input Format Counters
>> Bytes Read 0
>>
>> Map-Reduce Framework
>> Map output materialized bytes 120,448,992
>> Combine output records 0
>> Map input records 5,208,607
>> Physical memory (bytes) snapshot 965,730,304
>> Spilled Records 10,417,214
>> Map output bytes 282,122,973
>> CPU time spent (ms) 82,610
>> Total committed heap usage (bytes) 1,061,158,912
>> Virtual memory (bytes) snapshot 1,681,047,552
>> Combine input records 0
>> Map output records 5,208,607
>> SPLIT_RAW_BYTES 176
>>
>>
>> On Tue, Jul 22, 2014 at 12:11 PM, Stack <st...@duboce.net> wrote:
>> > How many regions now?
>> >
>> > You still have 20 concurrent mappers running?  Are your machines loaded
>> w/
>> > 4 map tasks on each?  Can you up the number of concurrent mappers?  Can
>> you
>> > get an idea of your scan rates?  Are all map tasks scanning at same rate?
>> >  Does one task lag the others?  Do you emit stats on each map task such
>> as
>> > rows processed? Can you figure your bottleneck? Are you seeking disk all
>> > the time?  Anything else running while this big scan is going on?  How
>> big
>> > are your cells?  Do you have one or more column families?  How many
>> columns?
>> >
>> > For average region size, do du on the hdfs region directories and then
>> sum
>> > and divide by region count.
>> >
>> > St.Ack
>> >
>> >
>> > On Mon, Jul 21, 2014 at 7:30 PM, Li Li <fa...@gmail.com> wrote:
>> >
>> >> anyone could help? now I have about 1.1 billion nodes and it takes 2
>> >> hours to finish a map reduce job.
>> >>
>> >> ---------- Forwarded message ----------
>> >> From: Li Li <fa...@gmail.com>
>> >> Date: Thu, Jun 26, 2014 at 3:34 PM
>> >> Subject: how to do parallel scanning in map reduce using hbase as input?
>> >> To: user@hbase.apache.org
>> >>
>> >>
>> >> my table has about 700 million rows and about 80 regions. each task
>> >> tracker is configured with 4 mappers and 4 reducers at the same time.
>> >> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
>> >> mappers running. it takes more than an hour to finish mapper stage.
>> >> The hbase cluster's load is very low, about 2,000 request per second.
>> >> I think one mapper for a region is too small. How can I run more than
>> >> one mapper for a region so that it can take full advantage of
>> >> computing resources?
>> >>
>>

Re: how to do parallel scanning in map reduce using hbase as input?

Posted by Stack <st...@duboce.net>.

On Mon, Jul 21, 2014 at 10:32 PM, Li Li <fa...@gmail.com> wrote:

> 1. yes, I have 20 concurrent running mappers.
> 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
> set 8 mappers, it hit oov exception and load average is high
>


What is OOV?

Do you have to have a reducer?

Load average is high?  How high?



> 3. fast mapper only use 1 minute. following is the statistics
>


So each region is only taking 1 minute to scan?  1.4Gs scanned?

Can you add other counters to your MR job so we can get more of an idea of
what is going on in it?

Please answer my other questions.

Thanks,
St.Ack


> HBase Counters
> REMOTE_RPC_CALLS 0
> RPC_CALLS 523
> RPC_RETRIES 0
> NOT_SERVING_REGION_EXCEPTION 0
> NUM_SCANNER_RESTARTS 0
> MILLIS_BETWEEN_NEXTS 62,415
> BYTES_IN_RESULTS 1,380,694,667
> BYTES_IN_REMOTE_RESULTS 0
> REGIONS_SCANNED 1
> REMOTE_RPC_RETRIES 0
>
> FileSystemCounters
> FILE_BYTES_READ 120,508,552
> HDFS_BYTES_READ 176
> FILE_BYTES_WRITTEN 241,000,600
>
> File Input Format Counters
> Bytes Read 0
>
> Map-Reduce Framework
> Map output materialized bytes 120,448,992
> Combine output records 0
> Map input records 5,208,607
> Physical memory (bytes) snapshot 965,730,304
> Spilled Records 10,417,214
> Map output bytes 282,122,973
> CPU time spent (ms) 82,610
> Total committed heap usage (bytes) 1,061,158,912
> Virtual memory (bytes) snapshot 1,681,047,552
> Combine input records 0
> Map output records 5,208,607
> SPLIT_RAW_BYTES 176
>
>
> On Tue, Jul 22, 2014 at 12:11 PM, Stack <st...@duboce.net> wrote:
> > How many regions now?
> >
> > You still have 20 concurrent mappers running?  Are your machines loaded
> w/
> > 4 map tasks on each?  Can you up the number of concurrent mappers?  Can
> you
> > get an idea of your scan rates?  Are all map tasks scanning at same rate?
> >  Does one task lag the others?  Do you emit stats on each map task such
> as
> > rows processed? Can you figure your bottleneck? Are you seeking disk all
> > the time?  Anything else running while this big scan is going on?  How
> big
> > are your cells?  Do you have one or more column families?  How many
> columns?
> >
> > For average region size, do du on the hdfs region directories and then
> sum
> > and divide by region count.
> >
> > St.Ack
> >
> >
> > On Mon, Jul 21, 2014 at 7:30 PM, Li Li <fa...@gmail.com> wrote:
> >
> >> anyone could help? now I have about 1.1 billion nodes and it takes 2
> >> hours to finish a map reduce job.
> >>
> >> ---------- Forwarded message ----------
> >> From: Li Li <fa...@gmail.com>
> >> Date: Thu, Jun 26, 2014 at 3:34 PM
> >> Subject: how to do parallel scanning in map reduce using hbase as input?
> >> To: user@hbase.apache.org
> >>
> >>
> >> my table has about 700 million rows and about 80 regions. each task
> >> tracker is configured with 4 mappers and 4 reducers at the same time.
> >> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
> >> mappers running. it takes more than an hour to finish mapper stage.
> >> The hbase cluster's load is very low, about 2,000 request per second.
> >> I think one mapper for a region is too small. How can I run more than
> >> one mapper for a region so that it can take full advantage of
> >> computing resources?
> >>
>

Re: how to do parallel scanning in map reduce using hbase as input?

Posted by Stack <st...@duboce.net>.

On Mon, Jul 21, 2014 at 11:08 PM, Li Li <fa...@gmail.com> wrote:

> On Tue, Jul 22, 2014 at 1:54 PM, Stack <st...@duboce.net> wrote:
> > On Mon, Jul 21, 2014 at 10:47 PM, Li Li <fa...@gmail.com> wrote:
> >
> >> sorry. I have not finished it.
> >> 1. yes, I have 20 concurrent running mappers.
> >> 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
> >> set 8 mappers, it hit oov exception and load average is high
> >> 3. fast mapper only use 1 minute. following is the statistics
> >> HBase Counters
> >>   REMOTE_RPC_CALLS 0
> >>   RPC_CALLS 523
> >>   RPC_RETRIES 0
> >>   NOT_SERVING_REGION_EXCEPTION 0
> >>   NUM_SCANNER_RESTARTS 0
> >>   MILLIS_BETWEEN_NEXTS 62,415
> >>   BYTES_IN_RESULTS 1,380,694,667
> >>   BYTES_IN_REMOTE_RESULTS 0
> >>   REGIONS_SCANNED 1
> >>   REMOTE_RPC_RETRIES 0
> >>
> >> FileSystemCounters
> >>   FILE_BYTES_READ 120,508,552
> >>   HDFS_BYTES_READ 176
> >>   FILE_BYTES_WRITTEN 241,000,600
> >>
> >> File Input Format Counters
> >>   Bytes Read 0
> >>
> >> Map-Reduce Framework
> >>   Map output materialized bytes 120,448,992
> >>   Combine output records 0
> >>   Map input records 5,208,607
> >>   Physical memory (bytes) snapshot 965,730,304
> >>   Spilled Records 10,417,214
> >>   Map output bytes 282,122,973
> >>   CPU time spent (ms) 82,610
> >>   Total committed heap usage (bytes) 1,061,158,912
> >>   Virtual memory (bytes) snapshot 1,681,047,552
> >>   Combine input records 0
> >>   Map output records 5,208,607
> >>   SPLIT_RAW_BYTES 176
> >>
> >>  slow mapper cost 25 minutes
> >>
> >
> >
> > So some mappers take 1 minute and others take 25 minutes?
> yes
>


If you look at the job on the tail, is it waiting on a single mapper to
finish?  Or do all finish at around the same time?  If this the case, then
you need to work on either speeding up the scan or upping the parallelism.
 If the skew is adding a long tail to your completion, get to know your key
space better and do splits so the data is evenly spread among the map tasks.

I have asked a few times if the reduce phase is necessary?  If you could
redo your job so this was not needed, that'd save a bunch.



> >
> > Do the map tasks balance each other out as they run or are you waiting on
> > one to complete, a really big one?
> >
> sorry, the fatest mapper takes 6 minutes and the slowest mapper takes 45
> minutes
> for the fastest, map input records is 4,469,570
> for the slowest one, input records is 22,335,536
>

Yeah, but is the skew the reason your job takes too long to finish?




> >
> >
> >> HBase Counters
> >>   REMOTE_RPC_CALLS 0
> >>   RPC_CALLS 2,268
> >>   RPC_RETRIES 0
> >>   NOT_SERVING_REGION_EXCEPTION 0
> >>   NUM_SCANNER_RESTARTS 0
> >>   MILLIS_BETWEEN_NEXTS 907,402
> >>   BYTES_IN_RESULTS 9,459,568,932
> >>   BYTES_IN_REMOTE_RESULTS 0
> >>   REGIONS_SCANNED 1
> >>   REMOTE_RPC_RETRIES 0
> >>
> >> FileSystemCounters
> >>   FILE_BYTES_READ 2,274,832,004
> >>   HDFS_BYTES_READ 161
> >>   FILE_BYTES_WRITTEN 3,770,108,961
> >>
> >> File Input Format Counters
> >>   Bytes Read 0
> >>
> >> Map-Reduce Framework
> >>   Map output materialized bytes 1,495,451,997
> >>   Combine output records 0
> >>   Map input records 22,659,551
> >>   Physical memory (bytes) snapshot 976,842,752
> >>   Spilled Records 57,085,847
> >>   Map output bytes 3,348,373,811
> >>   CPU time spent (ms) 1,134,640
> >>   Total committed heap usage (bytes) 945,291,264
> >>   Virtual memory (bytes) snapshot 1,699,991,552
> >>   Combine input records 0
> >>   Map output records 22,644,687
> >>   SPLIT_RAW_BYTES 161
> >>
> >> 4. I have about 11 billion rows and it takes 1.3TB(hdfs usage) and the
> >> replication factor is 2
> >>
> >
> > Make it 3 to be safe?
> yes, but I have not enough disk. the total disk usage(including non
> hdfs) is about 60%
>

In general, you may need more machines in the mix if you want to meet your
requirement.  Or do you think the current hardware set is currently
underutilized?



> >
> >
> >
> >> 5. for block information,
> >> one column family file:
> >> Name Type Size Replication Block Size Modification Time Permission Owner
> >> Group
> >> b8297e0a415a4ddc811009e70aa30371 file 195.43 MB 2 64 MB 2014-07-22
> >> 10:16 rw-r--r-- hadoop supergroup
> >> dea1d498ec6d46ea84ad35ea6cc3cf6e file 5.12 GB 2 64 MB 2014-07-20 20:24
> >> rw-r--r-- hadoop supergroup
> >> ee01947bad6f450d89bd71be84d9d60a file 2.68 MB 2 64 MB 2014-07-22 13:18
> >> rw-r--r-- hadoop supergroup
> >>
> >> another example
> >> 1923bdcf47ed40879ec4a2f6d314167e file 729.43 MB 2 64 MB 2014-07-18
> >> 20:32 rw-r--r-- hadoop supergroup
> >> 532d56af4457492194c5336f1f1d8359 file 372.27 MB 2 64 MB 2014-07-21
> >> 20:55 rw-r--r-- hadoop supergroup
> >> 55e92aef7b754059be9fc7e4692832ec file 117.45 MB 2 64 MB 2014-07-22
> >> 13:19 rw-r--r-- hadoop supergroup
> >> c927509f280a4cb3bc5c6db2feea5c16 file 7.87 GB 2 64 MB 2014-07-12 06:55
> >> rw-r--r-- hadoop supergroup
> >
> >
> >
> >> 6. I have only one column family for this table
> >>
> >> 7. each row has less than 10 columns
> >>
> >>
> > Ok.
> >
> > Does each cell have many versions or just one?
> just one version
> >
>

But are you overwriting the data so though you are fetching one version
only, there may be many present?



> >
> >
> >> 8. region info in web ui
> >> ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> >> Storefile Size Index Size Bloom Size
> >> mphbase1,60020,1405730850512 46 103 126528m 126567mb 94993k 329266k
> >> mphbase2,60020,1405730850549 45 100 157746m 157789mb 117250k 432066k
> >> mphbase3,60020,1405730850546 46 46 53592m 53610mb 42858k 110748k
> >> mphbase4,60020,1405730850585 43 101 109790m 109827mb 83236k 295068k
> >> mphbase5,60020,1405730850652 41 81 89073m 89099mb 66622k 243354k
> >>
> >> 9. url_db has 84 regions
> >>
> >>
> > What version of HBase?  You've set scan caching to be a decent number?
> >  1000 or so (presuming cells are not massive)?
> 0.96.2-hadoop1, r1581096
> I have set cache to 10,000
>
>

Good.
St.Ack



> >
> > St.Ack
> >
> >
> >
> >> On Tue, Jul 22, 2014 at 1:32 PM, Li Li <fa...@gmail.com> wrote:
> >> > 1. yes, I have 20 concurrent running mappers.
> >> > 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
> >> > set 8 mappers, it hit oov exception and load average is high
> >> > 3. fast mapper only use 1 minute. following is the statistics
> >> > HBase Counters
> >> > REMOTE_RPC_CALLS 0
> >> > RPC_CALLS 523
> >> > RPC_RETRIES 0
> >> > NOT_SERVING_REGION_EXCEPTION 0
> >> > NUM_SCANNER_RESTARTS 0
> >> > MILLIS_BETWEEN_NEXTS 62,415
> >> > BYTES_IN_RESULTS 1,380,694,667
> >> > BYTES_IN_REMOTE_RESULTS 0
> >> > REGIONS_SCANNED 1
> >> > REMOTE_RPC_RETRIES 0
> >> >
> >> > FileSystemCounters
> >> > FILE_BYTES_READ 120,508,552
> >> > HDFS_BYTES_READ 176
> >> > FILE_BYTES_WRITTEN 241,000,600
> >> >
> >> > File Input Format Counters
> >> > Bytes Read 0
> >> >
> >> > Map-Reduce Framework
> >> > Map output materialized bytes 120,448,992
> >> > Combine output records 0
> >> > Map input records 5,208,607
> >> > Physical memory (bytes) snapshot 965,730,304
> >> > Spilled Records 10,417,214
> >> > Map output bytes 282,122,973
> >> > CPU time spent (ms) 82,610
> >> > Total committed heap usage (bytes) 1,061,158,912
> >> > Virtual memory (bytes) snapshot 1,681,047,552
> >> > Combine input records 0
> >> > Map output records 5,208,607
> >> > SPLIT_RAW_BYTES 176
> >> >
> >> >
> >> > On Tue, Jul 22, 2014 at 12:11 PM, Stack <st...@duboce.net> wrote:
> >> >> How many regions now?
> >> >>
> >> >> You still have 20 concurrent mappers running?  Are your machines
> loaded
> >> w/
> >> >> 4 map tasks on each?  Can you up the number of concurrent mappers?
>  Can
> >> you
> >> >> get an idea of your scan rates?  Are all map tasks scanning at same
> >> rate?
> >> >>  Does one task lag the others?  Do you emit stats on each map task
> such
> >> as
> >> >> rows processed? Can you figure your bottleneck? Are you seeking disk
> all
> >> >> the time?  Anything else running while this big scan is going on?
>  How
> >> big
> >> >> are your cells?  Do you have one or more column families?  How many
> >> columns?
> >> >>
> >> >> For average region size, do du on the hdfs region directories and
> then
> >> sum
> >> >> and divide by region count.
> >> >>
> >> >> St.Ack
> >> >>
> >> >>
> >> >> On Mon, Jul 21, 2014 at 7:30 PM, Li Li <fa...@gmail.com> wrote:
> >> >>
> >> >>> anyone could help? now I have about 1.1 billion nodes and it takes 2
> >> >>> hours to finish a map reduce job.
> >> >>>
> >> >>> ---------- Forwarded message ----------
> >> >>> From: Li Li <fa...@gmail.com>
> >> >>> Date: Thu, Jun 26, 2014 at 3:34 PM
> >> >>> Subject: how to do parallel scanning in map reduce using hbase as
> >> input?
> >> >>> To: user@hbase.apache.org
> >> >>>
> >> >>>
> >> >>> my table has about 700 million rows and about 80 regions. each task
> >> >>> tracker is configured with 4 mappers and 4 reducers at the same
> time.
> >> >>> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
> >> >>> mappers running. it takes more than an hour to finish mapper stage.
> >> >>> The hbase cluster's load is very low, about 2,000 request per
> second.
> >> >>> I think one mapper for a region is too small. How can I run more
> than
> >> >>> one mapper for a region so that it can take full advantage of
> >> >>> computing resources?
> >> >>>
> >>
>

Re: how to do parallel scanning in map reduce using hbase as input?

Posted by Li Li <fa...@gmail.com>.

On Tue, Jul 22, 2014 at 1:54 PM, Stack <st...@duboce.net> wrote:
> On Mon, Jul 21, 2014 at 10:47 PM, Li Li <fa...@gmail.com> wrote:
>
>> sorry. I have not finished it.
>> 1. yes, I have 20 concurrent running mappers.
>> 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
>> set 8 mappers, it hit oov exception and load average is high
>> 3. fast mapper only use 1 minute. following is the statistics
>> HBase Counters
>>   REMOTE_RPC_CALLS 0
>>   RPC_CALLS 523
>>   RPC_RETRIES 0
>>   NOT_SERVING_REGION_EXCEPTION 0
>>   NUM_SCANNER_RESTARTS 0
>>   MILLIS_BETWEEN_NEXTS 62,415
>>   BYTES_IN_RESULTS 1,380,694,667
>>   BYTES_IN_REMOTE_RESULTS 0
>>   REGIONS_SCANNED 1
>>   REMOTE_RPC_RETRIES 0
>>
>> FileSystemCounters
>>   FILE_BYTES_READ 120,508,552
>>   HDFS_BYTES_READ 176
>>   FILE_BYTES_WRITTEN 241,000,600
>>
>> File Input Format Counters
>>   Bytes Read 0
>>
>> Map-Reduce Framework
>>   Map output materialized bytes 120,448,992
>>   Combine output records 0
>>   Map input records 5,208,607
>>   Physical memory (bytes) snapshot 965,730,304
>>   Spilled Records 10,417,214
>>   Map output bytes 282,122,973
>>   CPU time spent (ms) 82,610
>>   Total committed heap usage (bytes) 1,061,158,912
>>   Virtual memory (bytes) snapshot 1,681,047,552
>>   Combine input records 0
>>   Map output records 5,208,607
>>   SPLIT_RAW_BYTES 176
>>
>>  slow mapper cost 25 minutes
>>
>
>
> So some mappers take 1 minute and others take 25 minutes?
yes
>
> Do the map tasks balance each other out as they run or are you waiting on
> one to complete, a really big one?
>
sorry, the fatest mapper takes 6 minutes and the slowest mapper takes 45 minutes
for the fastest, map input records is 4,469,570
for the slowest one, input records is 22,335,536
>
>
>> HBase Counters
>>   REMOTE_RPC_CALLS 0
>>   RPC_CALLS 2,268
>>   RPC_RETRIES 0
>>   NOT_SERVING_REGION_EXCEPTION 0
>>   NUM_SCANNER_RESTARTS 0
>>   MILLIS_BETWEEN_NEXTS 907,402
>>   BYTES_IN_RESULTS 9,459,568,932
>>   BYTES_IN_REMOTE_RESULTS 0
>>   REGIONS_SCANNED 1
>>   REMOTE_RPC_RETRIES 0
>>
>> FileSystemCounters
>>   FILE_BYTES_READ 2,274,832,004
>>   HDFS_BYTES_READ 161
>>   FILE_BYTES_WRITTEN 3,770,108,961
>>
>> File Input Format Counters
>>   Bytes Read 0
>>
>> Map-Reduce Framework
>>   Map output materialized bytes 1,495,451,997
>>   Combine output records 0
>>   Map input records 22,659,551
>>   Physical memory (bytes) snapshot 976,842,752
>>   Spilled Records 57,085,847
>>   Map output bytes 3,348,373,811
>>   CPU time spent (ms) 1,134,640
>>   Total committed heap usage (bytes) 945,291,264
>>   Virtual memory (bytes) snapshot 1,699,991,552
>>   Combine input records 0
>>   Map output records 22,644,687
>>   SPLIT_RAW_BYTES 161
>>
>> 4. I have about 11 billion rows and it takes 1.3TB(hdfs usage) and the
>> replication factor is 2
>>
>
> Make it 3 to be safe?
yes, but I have not enough disk. the total disk usage(including non
hdfs) is about 60%
>
>
>
>> 5. for block information,
>> one column family file:
>> Name Type Size Replication Block Size Modification Time Permission Owner
>> Group
>> b8297e0a415a4ddc811009e70aa30371 file 195.43 MB 2 64 MB 2014-07-22
>> 10:16 rw-r--r-- hadoop supergroup
>> dea1d498ec6d46ea84ad35ea6cc3cf6e file 5.12 GB 2 64 MB 2014-07-20 20:24
>> rw-r--r-- hadoop supergroup
>> ee01947bad6f450d89bd71be84d9d60a file 2.68 MB 2 64 MB 2014-07-22 13:18
>> rw-r--r-- hadoop supergroup
>>
>> another example
>> 1923bdcf47ed40879ec4a2f6d314167e file 729.43 MB 2 64 MB 2014-07-18
>> 20:32 rw-r--r-- hadoop supergroup
>> 532d56af4457492194c5336f1f1d8359 file 372.27 MB 2 64 MB 2014-07-21
>> 20:55 rw-r--r-- hadoop supergroup
>> 55e92aef7b754059be9fc7e4692832ec file 117.45 MB 2 64 MB 2014-07-22
>> 13:19 rw-r--r-- hadoop supergroup
>> c927509f280a4cb3bc5c6db2feea5c16 file 7.87 GB 2 64 MB 2014-07-12 06:55
>> rw-r--r-- hadoop supergroup
>
>
>
>> 6. I have only one column family for this table
>>
>> 7. each row has less than 10 columns
>>
>>
> Ok.
>
> Does each cell have many versions or just one?
just one version
>
>
>
>> 8. region info in web ui
>> ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
>> Storefile Size Index Size Bloom Size
>> mphbase1,60020,1405730850512 46 103 126528m 126567mb 94993k 329266k
>> mphbase2,60020,1405730850549 45 100 157746m 157789mb 117250k 432066k
>> mphbase3,60020,1405730850546 46 46 53592m 53610mb 42858k 110748k
>> mphbase4,60020,1405730850585 43 101 109790m 109827mb 83236k 295068k
>> mphbase5,60020,1405730850652 41 81 89073m 89099mb 66622k 243354k
>>
>> 9. url_db has 84 regions
>>
>>
> What version of HBase?  You've set scan caching to be a decent number?
>  1000 or so (presuming cells are not massive)?
0.96.2-hadoop1, r1581096
I have set cache to 10,000

>
> St.Ack
>
>
>
>> On Tue, Jul 22, 2014 at 1:32 PM, Li Li <fa...@gmail.com> wrote:
>> > 1. yes, I have 20 concurrent running mappers.
>> > 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
>> > set 8 mappers, it hit oov exception and load average is high
>> > 3. fast mapper only use 1 minute. following is the statistics
>> > HBase Counters
>> > REMOTE_RPC_CALLS 0
>> > RPC_CALLS 523
>> > RPC_RETRIES 0
>> > NOT_SERVING_REGION_EXCEPTION 0
>> > NUM_SCANNER_RESTARTS 0
>> > MILLIS_BETWEEN_NEXTS 62,415
>> > BYTES_IN_RESULTS 1,380,694,667
>> > BYTES_IN_REMOTE_RESULTS 0
>> > REGIONS_SCANNED 1
>> > REMOTE_RPC_RETRIES 0
>> >
>> > FileSystemCounters
>> > FILE_BYTES_READ 120,508,552
>> > HDFS_BYTES_READ 176
>> > FILE_BYTES_WRITTEN 241,000,600
>> >
>> > File Input Format Counters
>> > Bytes Read 0
>> >
>> > Map-Reduce Framework
>> > Map output materialized bytes 120,448,992
>> > Combine output records 0
>> > Map input records 5,208,607
>> > Physical memory (bytes) snapshot 965,730,304
>> > Spilled Records 10,417,214
>> > Map output bytes 282,122,973
>> > CPU time spent (ms) 82,610
>> > Total committed heap usage (bytes) 1,061,158,912
>> > Virtual memory (bytes) snapshot 1,681,047,552
>> > Combine input records 0
>> > Map output records 5,208,607
>> > SPLIT_RAW_BYTES 176
>> >
>> >
>> > On Tue, Jul 22, 2014 at 12:11 PM, Stack <st...@duboce.net> wrote:
>> >> How many regions now?
>> >>
>> >> You still have 20 concurrent mappers running?  Are your machines loaded
>> w/
>> >> 4 map tasks on each?  Can you up the number of concurrent mappers?  Can
>> you
>> >> get an idea of your scan rates?  Are all map tasks scanning at same
>> rate?
>> >>  Does one task lag the others?  Do you emit stats on each map task such
>> as
>> >> rows processed? Can you figure your bottleneck? Are you seeking disk all
>> >> the time?  Anything else running while this big scan is going on?  How
>> big
>> >> are your cells?  Do you have one or more column families?  How many
>> columns?
>> >>
>> >> For average region size, do du on the hdfs region directories and then
>> sum
>> >> and divide by region count.
>> >>
>> >> St.Ack
>> >>
>> >>
>> >> On Mon, Jul 21, 2014 at 7:30 PM, Li Li <fa...@gmail.com> wrote:
>> >>
>> >>> anyone could help? now I have about 1.1 billion nodes and it takes 2
>> >>> hours to finish a map reduce job.
>> >>>
>> >>> ---------- Forwarded message ----------
>> >>> From: Li Li <fa...@gmail.com>
>> >>> Date: Thu, Jun 26, 2014 at 3:34 PM
>> >>> Subject: how to do parallel scanning in map reduce using hbase as
>> input?
>> >>> To: user@hbase.apache.org
>> >>>
>> >>>
>> >>> my table has about 700 million rows and about 80 regions. each task
>> >>> tracker is configured with 4 mappers and 4 reducers at the same time.
>> >>> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
>> >>> mappers running. it takes more than an hour to finish mapper stage.
>> >>> The hbase cluster's load is very low, about 2,000 request per second.
>> >>> I think one mapper for a region is too small. How can I run more than
>> >>> one mapper for a region so that it can take full advantage of
>> >>> computing resources?
>> >>>
>>

Re: how to do parallel scanning in map reduce using hbase as input?

Posted by Stack <st...@duboce.net>.

On Mon, Jul 21, 2014 at 10:47 PM, Li Li <fa...@gmail.com> wrote:

> sorry. I have not finished it.
> 1. yes, I have 20 concurrent running mappers.
> 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
> set 8 mappers, it hit oov exception and load average is high
> 3. fast mapper only use 1 minute. following is the statistics
> HBase Counters
>   REMOTE_RPC_CALLS 0
>   RPC_CALLS 523
>   RPC_RETRIES 0
>   NOT_SERVING_REGION_EXCEPTION 0
>   NUM_SCANNER_RESTARTS 0
>   MILLIS_BETWEEN_NEXTS 62,415
>   BYTES_IN_RESULTS 1,380,694,667
>   BYTES_IN_REMOTE_RESULTS 0
>   REGIONS_SCANNED 1
>   REMOTE_RPC_RETRIES 0
>
> FileSystemCounters
>   FILE_BYTES_READ 120,508,552
>   HDFS_BYTES_READ 176
>   FILE_BYTES_WRITTEN 241,000,600
>
> File Input Format Counters
>   Bytes Read 0
>
> Map-Reduce Framework
>   Map output materialized bytes 120,448,992
>   Combine output records 0
>   Map input records 5,208,607
>   Physical memory (bytes) snapshot 965,730,304
>   Spilled Records 10,417,214
>   Map output bytes 282,122,973
>   CPU time spent (ms) 82,610
>   Total committed heap usage (bytes) 1,061,158,912
>   Virtual memory (bytes) snapshot 1,681,047,552
>   Combine input records 0
>   Map output records 5,208,607
>   SPLIT_RAW_BYTES 176
>
>  slow mapper cost 25 minutes
>


So some mappers take 1 minute and others take 25 minutes?

Do the map tasks balance each other out as they run or are you waiting on
one to complete, a really big one?



> HBase Counters
>   REMOTE_RPC_CALLS 0
>   RPC_CALLS 2,268
>   RPC_RETRIES 0
>   NOT_SERVING_REGION_EXCEPTION 0
>   NUM_SCANNER_RESTARTS 0
>   MILLIS_BETWEEN_NEXTS 907,402
>   BYTES_IN_RESULTS 9,459,568,932
>   BYTES_IN_REMOTE_RESULTS 0
>   REGIONS_SCANNED 1
>   REMOTE_RPC_RETRIES 0
>
> FileSystemCounters
>   FILE_BYTES_READ 2,274,832,004
>   HDFS_BYTES_READ 161
>   FILE_BYTES_WRITTEN 3,770,108,961
>
> File Input Format Counters
>   Bytes Read 0
>
> Map-Reduce Framework
>   Map output materialized bytes 1,495,451,997
>   Combine output records 0
>   Map input records 22,659,551
>   Physical memory (bytes) snapshot 976,842,752
>   Spilled Records 57,085,847
>   Map output bytes 3,348,373,811
>   CPU time spent (ms) 1,134,640
>   Total committed heap usage (bytes) 945,291,264
>   Virtual memory (bytes) snapshot 1,699,991,552
>   Combine input records 0
>   Map output records 22,644,687
>   SPLIT_RAW_BYTES 161
>
> 4. I have about 11 billion rows and it takes 1.3TB(hdfs usage) and the
> replication factor is 2
>

Make it 3 to be safe?



> 5. for block information,
> one column family file:
> Name Type Size Replication Block Size Modification Time Permission Owner
> Group
> b8297e0a415a4ddc811009e70aa30371 file 195.43 MB 2 64 MB 2014-07-22
> 10:16 rw-r--r-- hadoop supergroup
> dea1d498ec6d46ea84ad35ea6cc3cf6e file 5.12 GB 2 64 MB 2014-07-20 20:24
> rw-r--r-- hadoop supergroup
> ee01947bad6f450d89bd71be84d9d60a file 2.68 MB 2 64 MB 2014-07-22 13:18
> rw-r--r-- hadoop supergroup
>
> another example
> 1923bdcf47ed40879ec4a2f6d314167e file 729.43 MB 2 64 MB 2014-07-18
> 20:32 rw-r--r-- hadoop supergroup
> 532d56af4457492194c5336f1f1d8359 file 372.27 MB 2 64 MB 2014-07-21
> 20:55 rw-r--r-- hadoop supergroup
> 55e92aef7b754059be9fc7e4692832ec file 117.45 MB 2 64 MB 2014-07-22
> 13:19 rw-r--r-- hadoop supergroup
> c927509f280a4cb3bc5c6db2feea5c16 file 7.87 GB 2 64 MB 2014-07-12 06:55
> rw-r--r-- hadoop supergroup



> 6. I have only one column family for this table
>
> 7. each row has less than 10 columns
>
>
Ok.

Does each cell have many versions or just one?



> 8. region info in web ui
> ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> Storefile Size Index Size Bloom Size
> mphbase1,60020,1405730850512 46 103 126528m 126567mb 94993k 329266k
> mphbase2,60020,1405730850549 45 100 157746m 157789mb 117250k 432066k
> mphbase3,60020,1405730850546 46 46 53592m 53610mb 42858k 110748k
> mphbase4,60020,1405730850585 43 101 109790m 109827mb 83236k 295068k
> mphbase5,60020,1405730850652 41 81 89073m 89099mb 66622k 243354k
>
> 9. url_db has 84 regions
>
>
What version of HBase?  You've set scan caching to be a decent number?
 1000 or so (presuming cells are not massive)?

St.Ack



> On Tue, Jul 22, 2014 at 1:32 PM, Li Li <fa...@gmail.com> wrote:
> > 1. yes, I have 20 concurrent running mappers.
> > 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
> > set 8 mappers, it hit oov exception and load average is high
> > 3. fast mapper only use 1 minute. following is the statistics
> > HBase Counters
> > REMOTE_RPC_CALLS 0
> > RPC_CALLS 523
> > RPC_RETRIES 0
> > NOT_SERVING_REGION_EXCEPTION 0
> > NUM_SCANNER_RESTARTS 0
> > MILLIS_BETWEEN_NEXTS 62,415
> > BYTES_IN_RESULTS 1,380,694,667
> > BYTES_IN_REMOTE_RESULTS 0
> > REGIONS_SCANNED 1
> > REMOTE_RPC_RETRIES 0
> >
> > FileSystemCounters
> > FILE_BYTES_READ 120,508,552
> > HDFS_BYTES_READ 176
> > FILE_BYTES_WRITTEN 241,000,600
> >
> > File Input Format Counters
> > Bytes Read 0
> >
> > Map-Reduce Framework
> > Map output materialized bytes 120,448,992
> > Combine output records 0
> > Map input records 5,208,607
> > Physical memory (bytes) snapshot 965,730,304
> > Spilled Records 10,417,214
> > Map output bytes 282,122,973
> > CPU time spent (ms) 82,610
> > Total committed heap usage (bytes) 1,061,158,912
> > Virtual memory (bytes) snapshot 1,681,047,552
> > Combine input records 0
> > Map output records 5,208,607
> > SPLIT_RAW_BYTES 176
> >
> >
> > On Tue, Jul 22, 2014 at 12:11 PM, Stack <st...@duboce.net> wrote:
> >> How many regions now?
> >>
> >> You still have 20 concurrent mappers running?  Are your machines loaded
> w/
> >> 4 map tasks on each?  Can you up the number of concurrent mappers?  Can
> you
> >> get an idea of your scan rates?  Are all map tasks scanning at same
> rate?
> >>  Does one task lag the others?  Do you emit stats on each map task such
> as
> >> rows processed? Can you figure your bottleneck? Are you seeking disk all
> >> the time?  Anything else running while this big scan is going on?  How
> big
> >> are your cells?  Do you have one or more column families?  How many
> columns?
> >>
> >> For average region size, do du on the hdfs region directories and then
> sum
> >> and divide by region count.
> >>
> >> St.Ack
> >>
> >>
> >> On Mon, Jul 21, 2014 at 7:30 PM, Li Li <fa...@gmail.com> wrote:
> >>
> >>> anyone could help? now I have about 1.1 billion nodes and it takes 2
> >>> hours to finish a map reduce job.
> >>>
> >>> ---------- Forwarded message ----------
> >>> From: Li Li <fa...@gmail.com>
> >>> Date: Thu, Jun 26, 2014 at 3:34 PM
> >>> Subject: how to do parallel scanning in map reduce using hbase as
> input?
> >>> To: user@hbase.apache.org
> >>>
> >>>
> >>> my table has about 700 million rows and about 80 regions. each task
> >>> tracker is configured with 4 mappers and 4 reducers at the same time.
> >>> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
> >>> mappers running. it takes more than an hour to finish mapper stage.
> >>> The hbase cluster's load is very low, about 2,000 request per second.
> >>> I think one mapper for a region is too small. How can I run more than
> >>> one mapper for a region so that it can take full advantage of
> >>> computing resources?
> >>>
>

Re: how to do parallel scanning in map reduce using hbase as input?

Posted by Li Li <fa...@gmail.com>.

sorry. I have not finished it.
1. yes, I have 20 concurrent running mappers.
2. I can't add more mappers because I set io.sort.mb to 500mb and if I
set 8 mappers, it hit oov exception and load average is high
3. fast mapper only use 1 minute. following is the statistics
HBase Counters
  REMOTE_RPC_CALLS 0
  RPC_CALLS 523
  RPC_RETRIES 0
  NOT_SERVING_REGION_EXCEPTION 0
  NUM_SCANNER_RESTARTS 0
  MILLIS_BETWEEN_NEXTS 62,415
  BYTES_IN_RESULTS 1,380,694,667
  BYTES_IN_REMOTE_RESULTS 0
  REGIONS_SCANNED 1
  REMOTE_RPC_RETRIES 0

FileSystemCounters
  FILE_BYTES_READ 120,508,552
  HDFS_BYTES_READ 176
  FILE_BYTES_WRITTEN 241,000,600

File Input Format Counters
  Bytes Read 0

Map-Reduce Framework
  Map output materialized bytes 120,448,992
  Combine output records 0
  Map input records 5,208,607
  Physical memory (bytes) snapshot 965,730,304
  Spilled Records 10,417,214
  Map output bytes 282,122,973
  CPU time spent (ms) 82,610
  Total committed heap usage (bytes) 1,061,158,912
  Virtual memory (bytes) snapshot 1,681,047,552
  Combine input records 0
  Map output records 5,208,607
  SPLIT_RAW_BYTES 176

 slow mapper cost 25 minutes
HBase Counters
  REMOTE_RPC_CALLS 0
  RPC_CALLS 2,268
  RPC_RETRIES 0
  NOT_SERVING_REGION_EXCEPTION 0
  NUM_SCANNER_RESTARTS 0
  MILLIS_BETWEEN_NEXTS 907,402
  BYTES_IN_RESULTS 9,459,568,932
  BYTES_IN_REMOTE_RESULTS 0
  REGIONS_SCANNED 1
  REMOTE_RPC_RETRIES 0

FileSystemCounters
  FILE_BYTES_READ 2,274,832,004
  HDFS_BYTES_READ 161
  FILE_BYTES_WRITTEN 3,770,108,961

File Input Format Counters
  Bytes Read 0

Map-Reduce Framework
  Map output materialized bytes 1,495,451,997
  Combine output records 0
  Map input records 22,659,551
  Physical memory (bytes) snapshot 976,842,752
  Spilled Records 57,085,847
  Map output bytes 3,348,373,811
  CPU time spent (ms) 1,134,640
  Total committed heap usage (bytes) 945,291,264
  Virtual memory (bytes) snapshot 1,699,991,552
  Combine input records 0
  Map output records 22,644,687
  SPLIT_RAW_BYTES 161

4. I have about 11 billion rows and it takes 1.3TB(hdfs usage) and the
replication factor is 2
5. for block information,
one column family file:
Name Type Size Replication Block Size Modification Time Permission Owner Group
b8297e0a415a4ddc811009e70aa30371 file 195.43 MB 2 64 MB 2014-07-22
10:16 rw-r--r-- hadoop supergroup
dea1d498ec6d46ea84ad35ea6cc3cf6e file 5.12 GB 2 64 MB 2014-07-20 20:24
rw-r--r-- hadoop supergroup
ee01947bad6f450d89bd71be84d9d60a file 2.68 MB 2 64 MB 2014-07-22 13:18
rw-r--r-- hadoop supergroup

another example
1923bdcf47ed40879ec4a2f6d314167e file 729.43 MB 2 64 MB 2014-07-18
20:32 rw-r--r-- hadoop supergroup
532d56af4457492194c5336f1f1d8359 file 372.27 MB 2 64 MB 2014-07-21
20:55 rw-r--r-- hadoop supergroup
55e92aef7b754059be9fc7e4692832ec file 117.45 MB 2 64 MB 2014-07-22
13:19 rw-r--r-- hadoop supergroup
c927509f280a4cb3bc5c6db2feea5c16 file 7.87 GB 2 64 MB 2014-07-12 06:55
rw-r--r-- hadoop supergroup

6. I have only one column family for this table

7. each row has less than 10 columns

8. region info in web ui
ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
Storefile Size Index Size Bloom Size
mphbase1,60020,1405730850512 46 103 126528m 126567mb 94993k 329266k
mphbase2,60020,1405730850549 45 100 157746m 157789mb 117250k 432066k
mphbase3,60020,1405730850546 46 46 53592m 53610mb 42858k 110748k
mphbase4,60020,1405730850585 43 101 109790m 109827mb 83236k 295068k
mphbase5,60020,1405730850652 41 81 89073m 89099mb 66622k 243354k

9. url_db has 84 regions

On Tue, Jul 22, 2014 at 1:32 PM, Li Li <fa...@gmail.com> wrote:
> 1. yes, I have 20 concurrent running mappers.
> 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
> set 8 mappers, it hit oov exception and load average is high
> 3. fast mapper only use 1 minute. following is the statistics
> HBase Counters
> REMOTE_RPC_CALLS 0
> RPC_CALLS 523
> RPC_RETRIES 0
> NOT_SERVING_REGION_EXCEPTION 0
> NUM_SCANNER_RESTARTS 0
> MILLIS_BETWEEN_NEXTS 62,415
> BYTES_IN_RESULTS 1,380,694,667
> BYTES_IN_REMOTE_RESULTS 0
> REGIONS_SCANNED 1
> REMOTE_RPC_RETRIES 0
>
> FileSystemCounters
> FILE_BYTES_READ 120,508,552
> HDFS_BYTES_READ 176
> FILE_BYTES_WRITTEN 241,000,600
>
> File Input Format Counters
> Bytes Read 0
>
> Map-Reduce Framework
> Map output materialized bytes 120,448,992
> Combine output records 0
> Map input records 5,208,607
> Physical memory (bytes) snapshot 965,730,304
> Spilled Records 10,417,214
> Map output bytes 282,122,973
> CPU time spent (ms) 82,610
> Total committed heap usage (bytes) 1,061,158,912
> Virtual memory (bytes) snapshot 1,681,047,552
> Combine input records 0
> Map output records 5,208,607
> SPLIT_RAW_BYTES 176
>
>
> On Tue, Jul 22, 2014 at 12:11 PM, Stack <st...@duboce.net> wrote:
>> How many regions now?
>>
>> You still have 20 concurrent mappers running?  Are your machines loaded w/
>> 4 map tasks on each?  Can you up the number of concurrent mappers?  Can you
>> get an idea of your scan rates?  Are all map tasks scanning at same rate?
>>  Does one task lag the others?  Do you emit stats on each map task such as
>> rows processed? Can you figure your bottleneck? Are you seeking disk all
>> the time?  Anything else running while this big scan is going on?  How big
>> are your cells?  Do you have one or more column families?  How many columns?
>>
>> For average region size, do du on the hdfs region directories and then sum
>> and divide by region count.
>>
>> St.Ack
>>
>>
>> On Mon, Jul 21, 2014 at 7:30 PM, Li Li <fa...@gmail.com> wrote:
>>
>>> anyone could help? now I have about 1.1 billion nodes and it takes 2
>>> hours to finish a map reduce job.
>>>
>>> ---------- Forwarded message ----------
>>> From: Li Li <fa...@gmail.com>
>>> Date: Thu, Jun 26, 2014 at 3:34 PM
>>> Subject: how to do parallel scanning in map reduce using hbase as input?
>>> To: user@hbase.apache.org
>>>
>>>
>>> my table has about 700 million rows and about 80 regions. each task
>>> tracker is configured with 4 mappers and 4 reducers at the same time.
>>> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
>>> mappers running. it takes more than an hour to finish mapper stage.
>>> The hbase cluster's load is very low, about 2,000 request per second.
>>> I think one mapper for a region is too small. How can I run more than
>>> one mapper for a region so that it can take full advantage of
>>> computing resources?
>>>

Re: how to do parallel scanning in map reduce using hbase as input?

Posted by Li Li <fa...@gmail.com>.

1. yes, I have 20 concurrent running mappers.
2. I can't add more mappers because I set io.sort.mb to 500mb and if I
set 8 mappers, it hit oov exception and load average is high
3. fast mapper only use 1 minute. following is the statistics
HBase Counters
REMOTE_RPC_CALLS 0
RPC_CALLS 523
RPC_RETRIES 0
NOT_SERVING_REGION_EXCEPTION 0
NUM_SCANNER_RESTARTS 0
MILLIS_BETWEEN_NEXTS 62,415
BYTES_IN_RESULTS 1,380,694,667
BYTES_IN_REMOTE_RESULTS 0
REGIONS_SCANNED 1
REMOTE_RPC_RETRIES 0

FileSystemCounters
FILE_BYTES_READ 120,508,552
HDFS_BYTES_READ 176
FILE_BYTES_WRITTEN 241,000,600

File Input Format Counters
Bytes Read 0

Map-Reduce Framework
Map output materialized bytes 120,448,992
Combine output records 0
Map input records 5,208,607
Physical memory (bytes) snapshot 965,730,304
Spilled Records 10,417,214
Map output bytes 282,122,973
CPU time spent (ms) 82,610
Total committed heap usage (bytes) 1,061,158,912
Virtual memory (bytes) snapshot 1,681,047,552
Combine input records 0
Map output records 5,208,607
SPLIT_RAW_BYTES 176


On Tue, Jul 22, 2014 at 12:11 PM, Stack <st...@duboce.net> wrote:
> How many regions now?
>
> You still have 20 concurrent mappers running?  Are your machines loaded w/
> 4 map tasks on each?  Can you up the number of concurrent mappers?  Can you
> get an idea of your scan rates?  Are all map tasks scanning at same rate?
>  Does one task lag the others?  Do you emit stats on each map task such as
> rows processed? Can you figure your bottleneck? Are you seeking disk all
> the time?  Anything else running while this big scan is going on?  How big
> are your cells?  Do you have one or more column families?  How many columns?
>
> For average region size, do du on the hdfs region directories and then sum
> and divide by region count.
>
> St.Ack
>
>
> On Mon, Jul 21, 2014 at 7:30 PM, Li Li <fa...@gmail.com> wrote:
>
>> anyone could help? now I have about 1.1 billion nodes and it takes 2
>> hours to finish a map reduce job.
>>
>> ---------- Forwarded message ----------
>> From: Li Li <fa...@gmail.com>
>> Date: Thu, Jun 26, 2014 at 3:34 PM
>> Subject: how to do parallel scanning in map reduce using hbase as input?
>> To: user@hbase.apache.org
>>
>>
>> my table has about 700 million rows and about 80 regions. each task
>> tracker is configured with 4 mappers and 4 reducers at the same time.
>> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
>> mappers running. it takes more than an hour to finish mapper stage.
>> The hbase cluster's load is very low, about 2,000 request per second.
>> I think one mapper for a region is too small. How can I run more than
>> one mapper for a region so that it can take full advantage of
>> computing resources?
>>

Re: how to do parallel scanning in map reduce using hbase as input?

Posted by Stack <st...@duboce.net>.

How many regions now?

You still have 20 concurrent mappers running?  Are your machines loaded w/
4 map tasks on each?  Can you up the number of concurrent mappers?  Can you
get an idea of your scan rates?  Are all map tasks scanning at same rate?
 Does one task lag the others?  Do you emit stats on each map task such as
rows processed? Can you figure your bottleneck? Are you seeking disk all
the time?  Anything else running while this big scan is going on?  How big
are your cells?  Do you have one or more column families?  How many columns?

For average region size, do du on the hdfs region directories and then sum
and divide by region count.

St.Ack

On Mon, Jul 21, 2014 at 7:30 PM, Li Li <fa...@gmail.com> wrote:

> anyone could help? now I have about 1.1 billion nodes and it takes 2
> hours to finish a map reduce job.
>
> ---------- Forwarded message ----------
> From: Li Li <fa...@gmail.com>
> Date: Thu, Jun 26, 2014 at 3:34 PM
> Subject: how to do parallel scanning in map reduce using hbase as input?
> To: user@hbase.apache.org
>
>
> my table has about 700 million rows and about 80 regions. each task
> tracker is configured with 4 mappers and 4 reducers at the same time.
> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
> mappers running. it takes more than an hour to finish mapper stage.
> The hbase cluster's load is very low, about 2,000 request per second.
> I think one mapper for a region is too small. How can I run more than
> one mapper for a region so that it can take full advantage of
> computing resources?
>