You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by 谢良 <xi...@xiaomi.com> on 2013/04/15 13:41:14 UTC

答复: HBase random read performance

First, it's probably helpless to set block size to 4KB, please refer to the beginning of HFile.java:

 Smaller blocks are good
 * for random access, but require more memory to hold the block index, and may
 * be slower to create (because we must flush the compressor stream at the
 * conclusion of each data block, which leads to an FS I/O flush). Further, due
 * to the internal caching in Compression codec, the smallest possible block
 * size would be around 20KB-30KB.

Second, is it a single-thread test client or multi-threads? we couldn't expect too much if the requests are one by one.

Third, could you provide more info about  your DN disk numbers and IO utils ?

Thanks,
Liang
________________________________________
发件人: Ankit Jain [ankitjaincs06@gmail.com]
发送时间: 2013年4月15日 18:53
收件人: user@hbase.apache.org
主题: Re: HBase random read performance

Hi Anoop,

Thanks for reply..

I tried by setting Hfile block size 4KB and also enabled the bloom
filter(ROW). The maximum read performance that I was able to achieve is
10000 records in 14 secs (size of record is 1.6KB).

Please suggest some tuning..

Thanks,
Ankit Jain



On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal <
rishabh.agrawal@impetus.co.in> wrote:

> Interesting. Can you explain why this happens?
>
> -----Original Message-----
> From: Anoop Sam John [mailto:anoopsj@huawei.com]
> Sent: Monday, April 15, 2013 3:47 PM
> To: user@hbase.apache.org
> Subject: RE: HBase random read performance
>
> Ankit
>                  I guess you might be having default HFile block size
> which is 64KB.
> For random gets a lower value will be better. Try will some thing like 8KB
> and check the latency?
>
> Ya ofcourse blooms can help (if major compaction was not done at the time
> of testing)
>
> -Anoop-
> ________________________________________
> From: Ankit Jain [ankitjaincs06@gmail.com]
> Sent: Saturday, April 13, 2013 11:01 AM
> To: user@hbase.apache.org
> Subject: HBase random read performance
>
> Hi All,
>
> We are using HBase 0.94.5 and Hadoop 1.0.4.
>
> We have HBase cluster of 5 nodes(5 regionservers and 1 master node). Each
> regionserver has 8 GB RAM.
>
> We have loaded 25 millions records in HBase table, regions are pre-split
> into 16 regions and all the regions are equally loaded.
>
> We are getting very low random read performance while performing multi get
> from HBase.
>
> We are passing random 10000 row-keys as input, while HBase is taking around
> 17 secs to return 10000 records.
>
> Please suggest some tuning to increase HBase read performance.
>
> Thanks,
> Ankit Jain
> iLabs
>
>
>
> --
> Thanks,
> Ankit Jain
>
> ________________________________
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>



--
Thanks,
Ankit Jain

Re: 答复: HBase random read performance

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Nicolas,

I think it might be good to create a JIRA for that anyway since seems that
some users are expecting this behaviour.

My 2¢ ;)

JM

2013/4/16 Nicolas Liochon <nk...@gmail.com>

> I think there is something in the middle that could be done. It was
> discussed here a while ago, but without any JIRA created.  See thread:
>
> http://mail-archives.apache.org/mod_mbox/hbase-user/201302.mbox/%3CCAKxWWm19OC+dePTK60bMmcecv=7tC+3t4-bQ6FDQepPiX_EWOA@mail.gmail.com%3E
>
> If someone can spend some time on it, I can create the JIRA...
>
> Nicolas
>
>
> On Tue, Apr 16, 2013 at 9:49 AM, Liu, Raymond <ra...@intel.com>
> wrote:
>
> > So what is lacking here? The action should also been parallel inside RS
> > for each region, Instead of just parallel on RS level?
> > Seems this will be rather difficult to implement, and for Get, might not
> > be worthy?
> >
> > >
> > > I looked
> > > at src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
> > > in
> > > 0.94
> > >
> > > In processBatchCallback(), starting line 1538,
> > >
> > >         // step 1: break up into regionserver-sized chunks and build
> the
> > data
> > > structs
> > >         Map<HRegionLocation, MultiAction<R>> actionsByServer =
> > >           new HashMap<HRegionLocation, MultiAction<R>>();
> > >         for (int i = 0; i < workingList.size(); i++) {
> > >
> > > So we do group individual action by server.
> > >
> > > FYI
> > >
> > > On Mon, Apr 15, 2013 at 6:30 AM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > Doug made a good point.
> > > >
> > > > Take a look at the performance gain for parallel scan (bottom chart
> > > > compared to top chart):
> > > >
> https://issues.apache.org/jira/secure/attachment/12578083/FDencode.png
> > > >
> > > > See
> > > >
> > > https://issues.apache.org/jira/browse/HBASE-8316?focusedCommentId=1362
> > >
> 8300&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpan
> > > el#comment-13628300for explanation of the two methods.
> > > >
> > > > Cheers
> > > >
> > > > On Mon, Apr 15, 2013 at 6:21 AM, Doug Meil
> > > <do...@explorysmedical.com>wrote:
> > > >
> > > >>
> > > >> Hi there, regarding this...
> > > >>
> > > >> > We are passing random 10000 row-keys as input, while HBase is
> > > >> > taking
> > > >> around
> > > >> > 17 secs to return 10000 records.
> > > >>
> > > >>
> > > >> ….  Given that you are generating 10,000 random keys, your multi-get
> > > >> is very likely hitting all 5 nodes of your cluster.
> > > >>
> > > >>
> > > >> Historically, multi-Get used to first sort the requests by RS and
> > > >> then
> > > >> *serially* go the RS to process the multi-Get.  I'm not sure of the
> > > >> current (0.94.x) behavior if it multi-threads or not.
> > > >>
> > > >> One thing you might want to consider is confirming that client
> > > >> behavior, and if it's not multi-threading then perform a test that
> > > >> does the same RS sorting via...
> > > >>
> > > >>
> > > >>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable
> > > >> .html#
> > > >> getRegionLocation%28byte[<
> http://hbase.apache.org/apidocs/org/apache/
> > > >> hadoop/hbase/client/HTable.html#getRegionLocation%28byte[>
> > > >> ]%29
> > > >>
> > > >> …. and then spin up your own threads (one per target RS) and see
> what
> > > >> happens.
> > > >>
> > > >>
> > > >>
> > > >> On 4/15/13 9:04 AM, "Ankit Jain" <an...@gmail.com> wrote:
> > > >>
> > > >> >Hi Liang,
> > > >> >
> > > >> >Thanks Liang for reply..
> > > >> >
> > > >> >Ans1:
> > > >> >I tried by using HFile block size of 32 KB and bloom filter is
> > enabled.
> > > >> >The
> > > >> >random read performance is 10000 records in 23 secs.
> > > >> >
> > > >> >Ans2:
> > > >> >We are retrieving all the 10000 rows in one call.
> > > >> >
> > > >> >Ans3:
> > > >> >Disk detai:
> > > >> >Model Number:       ST2000DM001-1CH164
> > > >> >Serial Number:      Z1E276YF
> > > >> >
> > > >> >Please suggest some more optimization
> > > >> >
> > > >> >Thanks,
> > > >> >Ankit Jain
> > > >> >
> > > >> >On Mon, Apr 15, 2013 at 5:11 PM, 谢良 <xi...@xiaomi.com> wrote:
> > > >> >
> > > >> >> First, it's probably helpless to set block size to 4KB, please
> > > >> >> refer to the beginning of HFile.java:
> > > >> >>
> > > >> >>  Smaller blocks are good
> > > >> >>  * for random access, but require more memory to hold the block
> > > >> >>index, and  may
> > > >> >>  * be slower to create (because we must flush the compressor
> > > >> >>stream at the
> > > >> >>  * conclusion of each data block, which leads to an FS I/O
> flush).
> > > >> >> Further, due
> > > >> >>  * to the internal caching in Compression codec, the smallest
> > > >> >>possible  block
> > > >> >>  * size would be around 20KB-30KB.
> > > >> >>
> > > >> >> Second, is it a single-thread test client or multi-threads? we
> > > >> >> couldn't expect too much if the requests are one by one.
> > > >> >>
> > > >> >> Third, could you provide more info about  your DN disk numbers
> and
> > > >> >> IO utils ?
> > > >> >>
> > > >> >> Thanks,
> > > >> >> Liang
> > > >> >> ________________________________________
> > > >> >> 发件人: Ankit Jain [ankitjaincs06@gmail.com]
> > > >> >> 发送时间: 2013年4月15日 18:53
> > > >> >> 收件人: user@hbase.apache.org
> > > >> >> 主题: Re: HBase random read performance
> > > >> >>
> > > >> >> Hi Anoop,
> > > >> >>
> > > >> >> Thanks for reply..
> > > >> >>
> > > >> >> I tried by setting Hfile block size 4KB and also enabled the
> bloom
> > > >> >> filter(ROW). The maximum read performance that I was able to
> > > >> >> achieve is
> > > >> >> 10000 records in 14 secs (size of record is 1.6KB).
> > > >> >>
> > > >> >> Please suggest some tuning..
> > > >> >>
> > > >> >> Thanks,
> > > >> >> Ankit Jain
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal <
> > > >> >> rishabh.agrawal@impetus.co.in> wrote:
> > > >> >>
> > > >> >> > Interesting. Can you explain why this happens?
> > > >> >> >
> > > >> >> > -----Original Message-----
> > > >> >> > From: Anoop Sam John [mailto:anoopsj@huawei.com]
> > > >> >> > Sent: Monday, April 15, 2013 3:47 PM
> > > >> >> > To: user@hbase.apache.org
> > > >> >> > Subject: RE: HBase random read performance
> > > >> >> >
> > > >> >> > Ankit
> > > >> >> >                  I guess you might be having default HFile
> block
> > > >> >> > size which is 64KB.
> > > >> >> > For random gets a lower value will be better. Try will some
> > > >> >> > thing
> > > >> like
> > > >> >> 8KB
> > > >> >> > and check the latency?
> > > >> >> >
> > > >> >> > Ya ofcourse blooms can help (if major compaction was not done
> at
> > > >> >> > the
> > > >> >>time
> > > >> >> > of testing)
> > > >> >> >
> > > >> >> > -Anoop-
> > > >> >> > ________________________________________
> > > >> >> > From: Ankit Jain [ankitjaincs06@gmail.com]
> > > >> >> > Sent: Saturday, April 13, 2013 11:01 AM
> > > >> >> > To: user@hbase.apache.org
> > > >> >> > Subject: HBase random read performance
> > > >> >> >
> > > >> >> > Hi All,
> > > >> >> >
> > > >> >> > We are using HBase 0.94.5 and Hadoop 1.0.4.
> > > >> >> >
> > > >> >> > We have HBase cluster of 5 nodes(5 regionservers and 1 master
> > node).
> > > >> >>Each
> > > >> >> > regionserver has 8 GB RAM.
> > > >> >> >
> > > >> >> > We have loaded 25 millions records in HBase table, regions are
> > > >> >>pre-split
> > > >> >> > into 16 regions and all the regions are equally loaded.
> > > >> >> >
> > > >> >> > We are getting very low random read performance while
> performing
> > > >> multi
> > > >> >> get
> > > >> >> > from HBase.
> > > >> >> >
> > > >> >> > We are passing random 10000 row-keys as input, while HBase is
> > > >> >> > taking
> > > >> >> around
> > > >> >> > 17 secs to return 10000 records.
> > > >> >> >
> > > >> >> > Please suggest some tuning to increase HBase read performance.
> > > >> >> >
> > > >> >> > Thanks,
> > > >> >> > Ankit Jain
> > > >> >> > iLabs
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> > --
> > > >> >> > Thanks,
> > > >> >> > Ankit Jain
> > > >> >> >
> > > >> >> > ________________________________
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> > NOTE: This message may contain information that is
> confidential,
> > > >> >> > proprietary, privileged or otherwise protected by law. The
> > > >> >> > message is intended solely for the named addressee. If received
> > > >> >> > in error, please destroy and notify the sender. Any use of this
> > > >> >> > email is prohibited
> > > >> >>when
> > > >> >> > received in error. Impetus does not represent, warrant and/or
> > > >> >>guarantee,
> > > >> >> > that the integrity of this communication has been maintained
> nor
> > > >> >> > that
> > > >> >>the
> > > >> >> > communication is free of errors, virus, interception or
> > interference.
> > > >> >> >
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> --
> > > >> >> Thanks,
> > > >> >> Ankit Jain
> > > >> >>
> > > >> >
> > > >> >
> > > >> >
> > > >> >--
> > > >> >Thanks,
> > > >> >Ankit Jain
> > > >>
> > > >>
> > > >
> >
>

Re: 答复: HBase random read performance

Posted by Nicolas Liochon <nk...@gmail.com>.

I think there is something in the middle that could be done. It was
discussed here a while ago, but without any JIRA created.  See thread:
http://mail-archives.apache.org/mod_mbox/hbase-user/201302.mbox/%3CCAKxWWm19OC+dePTK60bMmcecv=7tC+3t4-bQ6FDQepPiX_EWOA@mail.gmail.com%3E

If someone can spend some time on it, I can create the JIRA...

Nicolas


On Tue, Apr 16, 2013 at 9:49 AM, Liu, Raymond <ra...@intel.com> wrote:

> So what is lacking here? The action should also been parallel inside RS
> for each region, Instead of just parallel on RS level?
> Seems this will be rather difficult to implement, and for Get, might not
> be worthy?
>
> >
> > I looked
> > at src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
> > in
> > 0.94
> >
> > In processBatchCallback(), starting line 1538,
> >
> >         // step 1: break up into regionserver-sized chunks and build the
> data
> > structs
> >         Map<HRegionLocation, MultiAction<R>> actionsByServer =
> >           new HashMap<HRegionLocation, MultiAction<R>>();
> >         for (int i = 0; i < workingList.size(); i++) {
> >
> > So we do group individual action by server.
> >
> > FYI
> >
> > On Mon, Apr 15, 2013 at 6:30 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Doug made a good point.
> > >
> > > Take a look at the performance gain for parallel scan (bottom chart
> > > compared to top chart):
> > > https://issues.apache.org/jira/secure/attachment/12578083/FDencode.png
> > >
> > > See
> > >
> > https://issues.apache.org/jira/browse/HBASE-8316?focusedCommentId=1362
> > 8300&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpan
> > el#comment-13628300for explanation of the two methods.
> > >
> > > Cheers
> > >
> > > On Mon, Apr 15, 2013 at 6:21 AM, Doug Meil
> > <do...@explorysmedical.com>wrote:
> > >
> > >>
> > >> Hi there, regarding this...
> > >>
> > >> > We are passing random 10000 row-keys as input, while HBase is
> > >> > taking
> > >> around
> > >> > 17 secs to return 10000 records.
> > >>
> > >>
> > >> ….  Given that you are generating 10,000 random keys, your multi-get
> > >> is very likely hitting all 5 nodes of your cluster.
> > >>
> > >>
> > >> Historically, multi-Get used to first sort the requests by RS and
> > >> then
> > >> *serially* go the RS to process the multi-Get.  I'm not sure of the
> > >> current (0.94.x) behavior if it multi-threads or not.
> > >>
> > >> One thing you might want to consider is confirming that client
> > >> behavior, and if it's not multi-threading then perform a test that
> > >> does the same RS sorting via...
> > >>
> > >>
> > >> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable
> > >> .html#
> > >> getRegionLocation%28byte[<http://hbase.apache.org/apidocs/org/apache/
> > >> hadoop/hbase/client/HTable.html#getRegionLocation%28byte[>
> > >> ]%29
> > >>
> > >> …. and then spin up your own threads (one per target RS) and see what
> > >> happens.
> > >>
> > >>
> > >>
> > >> On 4/15/13 9:04 AM, "Ankit Jain" <an...@gmail.com> wrote:
> > >>
> > >> >Hi Liang,
> > >> >
> > >> >Thanks Liang for reply..
> > >> >
> > >> >Ans1:
> > >> >I tried by using HFile block size of 32 KB and bloom filter is
> enabled.
> > >> >The
> > >> >random read performance is 10000 records in 23 secs.
> > >> >
> > >> >Ans2:
> > >> >We are retrieving all the 10000 rows in one call.
> > >> >
> > >> >Ans3:
> > >> >Disk detai:
> > >> >Model Number:       ST2000DM001-1CH164
> > >> >Serial Number:      Z1E276YF
> > >> >
> > >> >Please suggest some more optimization
> > >> >
> > >> >Thanks,
> > >> >Ankit Jain
> > >> >
> > >> >On Mon, Apr 15, 2013 at 5:11 PM, 谢良 <xi...@xiaomi.com> wrote:
> > >> >
> > >> >> First, it's probably helpless to set block size to 4KB, please
> > >> >> refer to the beginning of HFile.java:
> > >> >>
> > >> >>  Smaller blocks are good
> > >> >>  * for random access, but require more memory to hold the block
> > >> >>index, and  may
> > >> >>  * be slower to create (because we must flush the compressor
> > >> >>stream at the
> > >> >>  * conclusion of each data block, which leads to an FS I/O flush).
> > >> >> Further, due
> > >> >>  * to the internal caching in Compression codec, the smallest
> > >> >>possible  block
> > >> >>  * size would be around 20KB-30KB.
> > >> >>
> > >> >> Second, is it a single-thread test client or multi-threads? we
> > >> >> couldn't expect too much if the requests are one by one.
> > >> >>
> > >> >> Third, could you provide more info about  your DN disk numbers and
> > >> >> IO utils ?
> > >> >>
> > >> >> Thanks,
> > >> >> Liang
> > >> >> ________________________________________
> > >> >> 发件人: Ankit Jain [ankitjaincs06@gmail.com]
> > >> >> 发送时间: 2013年4月15日 18:53
> > >> >> 收件人: user@hbase.apache.org
> > >> >> 主题: Re: HBase random read performance
> > >> >>
> > >> >> Hi Anoop,
> > >> >>
> > >> >> Thanks for reply..
> > >> >>
> > >> >> I tried by setting Hfile block size 4KB and also enabled the bloom
> > >> >> filter(ROW). The maximum read performance that I was able to
> > >> >> achieve is
> > >> >> 10000 records in 14 secs (size of record is 1.6KB).
> > >> >>
> > >> >> Please suggest some tuning..
> > >> >>
> > >> >> Thanks,
> > >> >> Ankit Jain
> > >> >>
> > >> >>
> > >> >>
> > >> >> On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal <
> > >> >> rishabh.agrawal@impetus.co.in> wrote:
> > >> >>
> > >> >> > Interesting. Can you explain why this happens?
> > >> >> >
> > >> >> > -----Original Message-----
> > >> >> > From: Anoop Sam John [mailto:anoopsj@huawei.com]
> > >> >> > Sent: Monday, April 15, 2013 3:47 PM
> > >> >> > To: user@hbase.apache.org
> > >> >> > Subject: RE: HBase random read performance
> > >> >> >
> > >> >> > Ankit
> > >> >> >                  I guess you might be having default HFile block
> > >> >> > size which is 64KB.
> > >> >> > For random gets a lower value will be better. Try will some
> > >> >> > thing
> > >> like
> > >> >> 8KB
> > >> >> > and check the latency?
> > >> >> >
> > >> >> > Ya ofcourse blooms can help (if major compaction was not done at
> > >> >> > the
> > >> >>time
> > >> >> > of testing)
> > >> >> >
> > >> >> > -Anoop-
> > >> >> > ________________________________________
> > >> >> > From: Ankit Jain [ankitjaincs06@gmail.com]
> > >> >> > Sent: Saturday, April 13, 2013 11:01 AM
> > >> >> > To: user@hbase.apache.org
> > >> >> > Subject: HBase random read performance
> > >> >> >
> > >> >> > Hi All,
> > >> >> >
> > >> >> > We are using HBase 0.94.5 and Hadoop 1.0.4.
> > >> >> >
> > >> >> > We have HBase cluster of 5 nodes(5 regionservers and 1 master
> node).
> > >> >>Each
> > >> >> > regionserver has 8 GB RAM.
> > >> >> >
> > >> >> > We have loaded 25 millions records in HBase table, regions are
> > >> >>pre-split
> > >> >> > into 16 regions and all the regions are equally loaded.
> > >> >> >
> > >> >> > We are getting very low random read performance while performing
> > >> multi
> > >> >> get
> > >> >> > from HBase.
> > >> >> >
> > >> >> > We are passing random 10000 row-keys as input, while HBase is
> > >> >> > taking
> > >> >> around
> > >> >> > 17 secs to return 10000 records.
> > >> >> >
> > >> >> > Please suggest some tuning to increase HBase read performance.
> > >> >> >
> > >> >> > Thanks,
> > >> >> > Ankit Jain
> > >> >> > iLabs
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > --
> > >> >> > Thanks,
> > >> >> > Ankit Jain
> > >> >> >
> > >> >> > ________________________________
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > NOTE: This message may contain information that is confidential,
> > >> >> > proprietary, privileged or otherwise protected by law. The
> > >> >> > message is intended solely for the named addressee. If received
> > >> >> > in error, please destroy and notify the sender. Any use of this
> > >> >> > email is prohibited
> > >> >>when
> > >> >> > received in error. Impetus does not represent, warrant and/or
> > >> >>guarantee,
> > >> >> > that the integrity of this communication has been maintained nor
> > >> >> > that
> > >> >>the
> > >> >> > communication is free of errors, virus, interception or
> interference.
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Thanks,
> > >> >> Ankit Jain
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> >--
> > >> >Thanks,
> > >> >Ankit Jain
> > >>
> > >>
> > >
>

RE: 答复: HBase random read performance

Posted by "Liu, Raymond" <ra...@intel.com>.

So what is lacking here? The action should also been parallel inside RS for each region, Instead of just parallel on RS level?
Seems this will be rather difficult to implement, and for Get, might not be worthy?

> 
> I looked
> at src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
> in
> 0.94
> 
> In processBatchCallback(), starting line 1538,
> 
>         // step 1: break up into regionserver-sized chunks and build the data
> structs
>         Map<HRegionLocation, MultiAction<R>> actionsByServer =
>           new HashMap<HRegionLocation, MultiAction<R>>();
>         for (int i = 0; i < workingList.size(); i++) {
> 
> So we do group individual action by server.
> 
> FYI
> 
> On Mon, Apr 15, 2013 at 6:30 AM, Ted Yu <yu...@gmail.com> wrote:
> 
> > Doug made a good point.
> >
> > Take a look at the performance gain for parallel scan (bottom chart
> > compared to top chart):
> > https://issues.apache.org/jira/secure/attachment/12578083/FDencode.png
> >
> > See
> >
> https://issues.apache.org/jira/browse/HBASE-8316?focusedCommentId=1362
> 8300&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpan
> el#comment-13628300for explanation of the two methods.
> >
> > Cheers
> >
> > On Mon, Apr 15, 2013 at 6:21 AM, Doug Meil
> <do...@explorysmedical.com>wrote:
> >
> >>
> >> Hi there, regarding this...
> >>
> >> > We are passing random 10000 row-keys as input, while HBase is
> >> > taking
> >> around
> >> > 17 secs to return 10000 records.
> >>
> >>
> >> ….  Given that you are generating 10,000 random keys, your multi-get
> >> is very likely hitting all 5 nodes of your cluster.
> >>
> >>
> >> Historically, multi-Get used to first sort the requests by RS and
> >> then
> >> *serially* go the RS to process the multi-Get.  I'm not sure of the
> >> current (0.94.x) behavior if it multi-threads or not.
> >>
> >> One thing you might want to consider is confirming that client
> >> behavior, and if it's not multi-threading then perform a test that
> >> does the same RS sorting via...
> >>
> >>
> >> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable
> >> .html#
> >> getRegionLocation%28byte[<http://hbase.apache.org/apidocs/org/apache/
> >> hadoop/hbase/client/HTable.html#getRegionLocation%28byte[>
> >> ]%29
> >>
> >> …. and then spin up your own threads (one per target RS) and see what
> >> happens.
> >>
> >>
> >>
> >> On 4/15/13 9:04 AM, "Ankit Jain" <an...@gmail.com> wrote:
> >>
> >> >Hi Liang,
> >> >
> >> >Thanks Liang for reply..
> >> >
> >> >Ans1:
> >> >I tried by using HFile block size of 32 KB and bloom filter is enabled.
> >> >The
> >> >random read performance is 10000 records in 23 secs.
> >> >
> >> >Ans2:
> >> >We are retrieving all the 10000 rows in one call.
> >> >
> >> >Ans3:
> >> >Disk detai:
> >> >Model Number:       ST2000DM001-1CH164
> >> >Serial Number:      Z1E276YF
> >> >
> >> >Please suggest some more optimization
> >> >
> >> >Thanks,
> >> >Ankit Jain
> >> >
> >> >On Mon, Apr 15, 2013 at 5:11 PM, 谢良 <xi...@xiaomi.com> wrote:
> >> >
> >> >> First, it's probably helpless to set block size to 4KB, please
> >> >> refer to the beginning of HFile.java:
> >> >>
> >> >>  Smaller blocks are good
> >> >>  * for random access, but require more memory to hold the block
> >> >>index, and  may
> >> >>  * be slower to create (because we must flush the compressor
> >> >>stream at the
> >> >>  * conclusion of each data block, which leads to an FS I/O flush).
> >> >> Further, due
> >> >>  * to the internal caching in Compression codec, the smallest
> >> >>possible  block
> >> >>  * size would be around 20KB-30KB.
> >> >>
> >> >> Second, is it a single-thread test client or multi-threads? we
> >> >> couldn't expect too much if the requests are one by one.
> >> >>
> >> >> Third, could you provide more info about  your DN disk numbers and
> >> >> IO utils ?
> >> >>
> >> >> Thanks,
> >> >> Liang
> >> >> ________________________________________
> >> >> 发件人: Ankit Jain [ankitjaincs06@gmail.com]
> >> >> 发送时间: 2013年4月15日 18:53
> >> >> 收件人: user@hbase.apache.org
> >> >> 主题: Re: HBase random read performance
> >> >>
> >> >> Hi Anoop,
> >> >>
> >> >> Thanks for reply..
> >> >>
> >> >> I tried by setting Hfile block size 4KB and also enabled the bloom
> >> >> filter(ROW). The maximum read performance that I was able to
> >> >> achieve is
> >> >> 10000 records in 14 secs (size of record is 1.6KB).
> >> >>
> >> >> Please suggest some tuning..
> >> >>
> >> >> Thanks,
> >> >> Ankit Jain
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal <
> >> >> rishabh.agrawal@impetus.co.in> wrote:
> >> >>
> >> >> > Interesting. Can you explain why this happens?
> >> >> >
> >> >> > -----Original Message-----
> >> >> > From: Anoop Sam John [mailto:anoopsj@huawei.com]
> >> >> > Sent: Monday, April 15, 2013 3:47 PM
> >> >> > To: user@hbase.apache.org
> >> >> > Subject: RE: HBase random read performance
> >> >> >
> >> >> > Ankit
> >> >> >                  I guess you might be having default HFile block
> >> >> > size which is 64KB.
> >> >> > For random gets a lower value will be better. Try will some
> >> >> > thing
> >> like
> >> >> 8KB
> >> >> > and check the latency?
> >> >> >
> >> >> > Ya ofcourse blooms can help (if major compaction was not done at
> >> >> > the
> >> >>time
> >> >> > of testing)
> >> >> >
> >> >> > -Anoop-
> >> >> > ________________________________________
> >> >> > From: Ankit Jain [ankitjaincs06@gmail.com]
> >> >> > Sent: Saturday, April 13, 2013 11:01 AM
> >> >> > To: user@hbase.apache.org
> >> >> > Subject: HBase random read performance
> >> >> >
> >> >> > Hi All,
> >> >> >
> >> >> > We are using HBase 0.94.5 and Hadoop 1.0.4.
> >> >> >
> >> >> > We have HBase cluster of 5 nodes(5 regionservers and 1 master node).
> >> >>Each
> >> >> > regionserver has 8 GB RAM.
> >> >> >
> >> >> > We have loaded 25 millions records in HBase table, regions are
> >> >>pre-split
> >> >> > into 16 regions and all the regions are equally loaded.
> >> >> >
> >> >> > We are getting very low random read performance while performing
> >> multi
> >> >> get
> >> >> > from HBase.
> >> >> >
> >> >> > We are passing random 10000 row-keys as input, while HBase is
> >> >> > taking
> >> >> around
> >> >> > 17 secs to return 10000 records.
> >> >> >
> >> >> > Please suggest some tuning to increase HBase read performance.
> >> >> >
> >> >> > Thanks,
> >> >> > Ankit Jain
> >> >> > iLabs
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Thanks,
> >> >> > Ankit Jain
> >> >> >
> >> >> > ________________________________
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > NOTE: This message may contain information that is confidential,
> >> >> > proprietary, privileged or otherwise protected by law. The
> >> >> > message is intended solely for the named addressee. If received
> >> >> > in error, please destroy and notify the sender. Any use of this
> >> >> > email is prohibited
> >> >>when
> >> >> > received in error. Impetus does not represent, warrant and/or
> >> >>guarantee,
> >> >> > that the integrity of this communication has been maintained nor
> >> >> > that
> >> >>the
> >> >> > communication is free of errors, virus, interception or interference.
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Thanks,
> >> >> Ankit Jain
> >> >>
> >> >
> >> >
> >> >
> >> >--
> >> >Thanks,
> >> >Ankit Jain
> >>
> >>
> >

Re: 答复: HBase random read performance

Posted by lars hofhansl <la...@apache.org>.

This fundamentally different, though. A scanner by default scans all regions serially, because it promises to return all rows in sort order.
A multi get is already parallelized across regions (and hence accross region servers).


Before we do a lot of work here we should fist make sure that nothing else is wrong with OPs setup.
17s for 10000 is not right.


Ankit, what does the IO look like across the machines in the cluster while this is happening?

Since you pick 10000 rows at random your expectation is that entire set of rows will fit into the block cache? Is that the case?

-- Lars



________________________________
 From: Ted Yu <yu...@gmail.com>
To: user@hbase.apache.org 
Sent: Monday, April 15, 2013 10:03 AM
Subject: Re: 答复: HBase random read performance
 

This is a related JIRA which should provide noticeable speed up:

HBASE-1935 Scan in parallel

Cheers

On Mon, Apr 15, 2013 at 7:13 AM, Ted Yu <yu...@gmail.com> wrote:

> I looked
> at src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java in
> 0.94
>
> In processBatchCallback(), starting line 1538,
>
>         // step 1: break up into regionserver-sized chunks and build the
> data structs
>         Map<HRegionLocation, MultiAction<R>> actionsByServer =
>           new HashMap<HRegionLocation, MultiAction<R>>();
>         for (int i = 0; i < workingList.size(); i++) {
>
> So we do group individual action by server.
>
> FYI
>
> On Mon, Apr 15, 2013 at 6:30 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> Doug made a good point.
>>
>> Take a look at the performance gain for parallel scan (bottom chart
>> compared to top chart):
>> https://issues.apache.org/jira/secure/attachment/12578083/FDencode.png
>>
>> See
>> https://issues.apache.org/jira/browse/HBASE-8316?focusedCommentId=13628300&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13628300for explanation of the two methods.
>>
>> Cheers
>>
>> On Mon, Apr 15, 2013 at 6:21 AM, Doug Meil <doug.meil@explorysmedical.com
>> > wrote:
>>
>>>
>>> Hi there, regarding this...
>>>
>>> > We are passing random 10000 row-keys as input, while HBase is taking
>>> around
>>> > 17 secs to return 10000 records.
>>>
>>>
>>> ….  Given that you are generating 10,000 random keys, your multi-get is
>>> very likely hitting all 5 nodes of your cluster.
>>>
>>>
>>> Historically, multi-Get used to first sort the requests by RS and then
>>> *serially* go the RS to process the multi-Get.  I'm not sure of the
>>> current (0.94.x) behavior if it multi-threads or not.
>>>
>>> One thing you might want to consider is confirming that client behavior,
>>> and if it's not multi-threading then perform a test that does the same RS
>>> sorting via...
>>>
>>>
>>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#
>>> getRegionLocation%28byte[<http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#getRegionLocation%28byte[>
>>> ]%29
>>>
>>> …. and then spin up your own threads (one per target RS) and see what
>>> happens.
>>>
>>>
>>>
>>> On 4/15/13 9:04 AM, "Ankit Jain" <an...@gmail.com> wrote:
>>>
>>> >Hi Liang,
>>> >
>>> >Thanks Liang for reply..
>>> >
>>> >Ans1:
>>> >I tried by using HFile block size of 32 KB and bloom filter is enabled.
>>> >The
>>> >random read performance is 10000 records in 23 secs.
>>> >
>>> >Ans2:
>>> >We are retrieving all the 10000 rows in one call.
>>> >
>>> >Ans3:
>>> >Disk detai:
>>> >Model Number:       ST2000DM001-1CH164
>>> >Serial Number:      Z1E276YF
>>> >
>>> >Please suggest some more optimization
>>> >
>>> >Thanks,
>>> >Ankit Jain
>>> >
>>> >On Mon, Apr 15, 2013 at 5:11 PM, 谢良 <xi...@xiaomi.com> wrote:
>>> >
>>> >> First, it's probably helpless to set block size to 4KB, please refer
>>> to
>>> >> the beginning of HFile.java:
>>> >>
>>> >>  Smaller blocks are good
>>> >>  * for random access, but require more memory to hold the block index,
>>> >>and
>>> >> may
>>> >>  * be slower to create (because we must flush the compressor stream at
>>> >>the
>>> >>  * conclusion of each data block, which leads to an FS I/O flush).
>>> >> Further, due
>>> >>  * to the internal caching in Compression codec, the smallest possible
>>> >> block
>>> >>  * size would be around 20KB-30KB.
>>> >>
>>> >> Second, is it a single-thread test client or multi-threads? we
>>> couldn't
>>> >> expect too much if the requests are one by one.
>>> >>
>>> >> Third, could you provide more info about  your DN disk numbers and IO
>>> >> utils ?
>>> >>
>>> >> Thanks,
>>> >> Liang
>>> >> ________________________________________
>>> >> 发件人: Ankit Jain [ankitjaincs06@gmail.com]
>>> >> 发送时间: 2013年4月15日 18:53
>>> >> 收件人: user@hbase.apache.org
>>> >> 主题: Re: HBase random read performance
>>> >>
>>> >> Hi Anoop,
>>> >>
>>> >> Thanks for reply..
>>> >>
>>> >> I tried by setting Hfile block size 4KB and also enabled the bloom
>>> >> filter(ROW). The maximum read performance that I was able to achieve
>>> is
>>> >> 10000 records in 14 secs (size of record is 1.6KB).
>>> >>
>>> >> Please suggest some tuning..
>>> >>
>>> >> Thanks,
>>> >> Ankit Jain
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal <
>>> >> rishabh.agrawal@impetus.co.in> wrote:
>>> >>
>>> >> > Interesting. Can you explain why this happens?
>>> >> >
>>> >> > -----Original Message-----
>>> >> > From: Anoop Sam John [mailto:anoopsj@huawei.com]
>>> >> > Sent: Monday, April 15, 2013 3:47 PM
>>> >> > To: user@hbase.apache.org
>>> >> > Subject: RE: HBase random read performance
>>> >> >
>>> >> > Ankit
>>> >> >                  I guess you might be having default HFile block
>>> size
>>> >> > which is 64KB.
>>> >> > For random gets a lower value will be better. Try will some thing
>>> like
>>> >> 8KB
>>> >> > and check the latency?
>>> >> >
>>> >> > Ya ofcourse blooms can help (if major compaction was not done at the
>>> >>time
>>> >> > of testing)
>>> >> >
>>> >> > -Anoop-
>>> >> > ________________________________________
>>> >> > From: Ankit Jain [ankitjaincs06@gmail.com]
>>> >> > Sent: Saturday, April 13, 2013 11:01 AM
>>> >> > To: user@hbase.apache.org
>>> >> > Subject: HBase random read performance
>>> >> >
>>> >> > Hi All,
>>> >> >
>>> >> > We are using HBase 0.94.5 and Hadoop 1.0.4.
>>> >> >
>>> >> > We have HBase cluster of 5 nodes(5 regionservers and 1 master node).
>>> >>Each
>>> >> > regionserver has 8 GB RAM.
>>> >> >
>>> >> > We have loaded 25 millions records in HBase table, regions are
>>> >>pre-split
>>> >> > into 16 regions and all the regions are equally loaded.
>>> >> >
>>> >> > We are getting very low random read performance while performing
>>> multi
>>> >> get
>>> >> > from HBase.
>>> >> >
>>> >> > We are passing random 10000 row-keys as input, while HBase is taking
>>> >> around
>>> >> > 17 secs to return 10000 records.
>>> >> >
>>> >> > Please suggest some tuning to increase HBase read performance.
>>> >> >
>>> >> > Thanks,
>>> >> > Ankit Jain
>>> >> > iLabs
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Thanks,
>>> >> > Ankit Jain
>>> >> >
>>> >> > ________________________________
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > NOTE: This message may contain information that is confidential,
>>> >> > proprietary, privileged or otherwise protected by law. The message
>>> is
>>> >> > intended solely for the named addressee. If received in error,
>>> please
>>> >> > destroy and notify the sender. Any use of this email is prohibited
>>> >>when
>>> >> > received in error. Impetus does not represent, warrant and/or
>>> >>guarantee,
>>> >> > that the integrity of this communication has been maintained nor
>>> that
>>> >>the
>>> >> > communication is free of errors, virus, interception or
>>> interference.
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Thanks,
>>> >> Ankit Jain
>>> >>
>>> >
>>> >
>>> >
>>> >--
>>> >Thanks,
>>> >Ankit Jain
>>>
>>>
>>
>

Re: 答复: HBase random read performance

Posted by Ted Yu <yu...@gmail.com>.

This is a related JIRA which should provide noticeable speed up:

HBASE-1935 Scan in parallel

Cheers

On Mon, Apr 15, 2013 at 7:13 AM, Ted Yu <yu...@gmail.com> wrote:

> I looked
> at src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java in
> 0.94
>
> In processBatchCallback(), starting line 1538,
>
>         // step 1: break up into regionserver-sized chunks and build the
> data structs
>         Map<HRegionLocation, MultiAction<R>> actionsByServer =
>           new HashMap<HRegionLocation, MultiAction<R>>();
>         for (int i = 0; i < workingList.size(); i++) {
>
> So we do group individual action by server.
>
> FYI
>
> On Mon, Apr 15, 2013 at 6:30 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> Doug made a good point.
>>
>> Take a look at the performance gain for parallel scan (bottom chart
>> compared to top chart):
>> https://issues.apache.org/jira/secure/attachment/12578083/FDencode.png
>>
>> See
>> https://issues.apache.org/jira/browse/HBASE-8316?focusedCommentId=13628300&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13628300for explanation of the two methods.
>>
>> Cheers
>>
>> On Mon, Apr 15, 2013 at 6:21 AM, Doug Meil <doug.meil@explorysmedical.com
>> > wrote:
>>
>>>
>>> Hi there, regarding this...
>>>
>>> > We are passing random 10000 row-keys as input, while HBase is taking
>>> around
>>> > 17 secs to return 10000 records.
>>>
>>>
>>> ….  Given that you are generating 10,000 random keys, your multi-get is
>>> very likely hitting all 5 nodes of your cluster.
>>>
>>>
>>> Historically, multi-Get used to first sort the requests by RS and then
>>> *serially* go the RS to process the multi-Get.  I'm not sure of the
>>> current (0.94.x) behavior if it multi-threads or not.
>>>
>>> One thing you might want to consider is confirming that client behavior,
>>> and if it's not multi-threading then perform a test that does the same RS
>>> sorting via...
>>>
>>>
>>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#
>>> getRegionLocation%28byte[<http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#getRegionLocation%28byte[>
>>> ]%29
>>>
>>> …. and then spin up your own threads (one per target RS) and see what
>>> happens.
>>>
>>>
>>>
>>> On 4/15/13 9:04 AM, "Ankit Jain" <an...@gmail.com> wrote:
>>>
>>> >Hi Liang,
>>> >
>>> >Thanks Liang for reply..
>>> >
>>> >Ans1:
>>> >I tried by using HFile block size of 32 KB and bloom filter is enabled.
>>> >The
>>> >random read performance is 10000 records in 23 secs.
>>> >
>>> >Ans2:
>>> >We are retrieving all the 10000 rows in one call.
>>> >
>>> >Ans3:
>>> >Disk detai:
>>> >Model Number:       ST2000DM001-1CH164
>>> >Serial Number:      Z1E276YF
>>> >
>>> >Please suggest some more optimization
>>> >
>>> >Thanks,
>>> >Ankit Jain
>>> >
>>> >On Mon, Apr 15, 2013 at 5:11 PM, 谢良 <xi...@xiaomi.com> wrote:
>>> >
>>> >> First, it's probably helpless to set block size to 4KB, please refer
>>> to
>>> >> the beginning of HFile.java:
>>> >>
>>> >>  Smaller blocks are good
>>> >>  * for random access, but require more memory to hold the block index,
>>> >>and
>>> >> may
>>> >>  * be slower to create (because we must flush the compressor stream at
>>> >>the
>>> >>  * conclusion of each data block, which leads to an FS I/O flush).
>>> >> Further, due
>>> >>  * to the internal caching in Compression codec, the smallest possible
>>> >> block
>>> >>  * size would be around 20KB-30KB.
>>> >>
>>> >> Second, is it a single-thread test client or multi-threads? we
>>> couldn't
>>> >> expect too much if the requests are one by one.
>>> >>
>>> >> Third, could you provide more info about  your DN disk numbers and IO
>>> >> utils ?
>>> >>
>>> >> Thanks,
>>> >> Liang
>>> >> ________________________________________
>>> >> 发件人: Ankit Jain [ankitjaincs06@gmail.com]
>>> >> 发送时间: 2013年4月15日 18:53
>>> >> 收件人: user@hbase.apache.org
>>> >> 主题: Re: HBase random read performance
>>> >>
>>> >> Hi Anoop,
>>> >>
>>> >> Thanks for reply..
>>> >>
>>> >> I tried by setting Hfile block size 4KB and also enabled the bloom
>>> >> filter(ROW). The maximum read performance that I was able to achieve
>>> is
>>> >> 10000 records in 14 secs (size of record is 1.6KB).
>>> >>
>>> >> Please suggest some tuning..
>>> >>
>>> >> Thanks,
>>> >> Ankit Jain
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal <
>>> >> rishabh.agrawal@impetus.co.in> wrote:
>>> >>
>>> >> > Interesting. Can you explain why this happens?
>>> >> >
>>> >> > -----Original Message-----
>>> >> > From: Anoop Sam John [mailto:anoopsj@huawei.com]
>>> >> > Sent: Monday, April 15, 2013 3:47 PM
>>> >> > To: user@hbase.apache.org
>>> >> > Subject: RE: HBase random read performance
>>> >> >
>>> >> > Ankit
>>> >> >                  I guess you might be having default HFile block
>>> size
>>> >> > which is 64KB.
>>> >> > For random gets a lower value will be better. Try will some thing
>>> like
>>> >> 8KB
>>> >> > and check the latency?
>>> >> >
>>> >> > Ya ofcourse blooms can help (if major compaction was not done at the
>>> >>time
>>> >> > of testing)
>>> >> >
>>> >> > -Anoop-
>>> >> > ________________________________________
>>> >> > From: Ankit Jain [ankitjaincs06@gmail.com]
>>> >> > Sent: Saturday, April 13, 2013 11:01 AM
>>> >> > To: user@hbase.apache.org
>>> >> > Subject: HBase random read performance
>>> >> >
>>> >> > Hi All,
>>> >> >
>>> >> > We are using HBase 0.94.5 and Hadoop 1.0.4.
>>> >> >
>>> >> > We have HBase cluster of 5 nodes(5 regionservers and 1 master node).
>>> >>Each
>>> >> > regionserver has 8 GB RAM.
>>> >> >
>>> >> > We have loaded 25 millions records in HBase table, regions are
>>> >>pre-split
>>> >> > into 16 regions and all the regions are equally loaded.
>>> >> >
>>> >> > We are getting very low random read performance while performing
>>> multi
>>> >> get
>>> >> > from HBase.
>>> >> >
>>> >> > We are passing random 10000 row-keys as input, while HBase is taking
>>> >> around
>>> >> > 17 secs to return 10000 records.
>>> >> >
>>> >> > Please suggest some tuning to increase HBase read performance.
>>> >> >
>>> >> > Thanks,
>>> >> > Ankit Jain
>>> >> > iLabs
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Thanks,
>>> >> > Ankit Jain
>>> >> >
>>> >> > ________________________________
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > NOTE: This message may contain information that is confidential,
>>> >> > proprietary, privileged or otherwise protected by law. The message
>>> is
>>> >> > intended solely for the named addressee. If received in error,
>>> please
>>> >> > destroy and notify the sender. Any use of this email is prohibited
>>> >>when
>>> >> > received in error. Impetus does not represent, warrant and/or
>>> >>guarantee,
>>> >> > that the integrity of this communication has been maintained nor
>>> that
>>> >>the
>>> >> > communication is free of errors, virus, interception or
>>> interference.
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Thanks,
>>> >> Ankit Jain
>>> >>
>>> >
>>> >
>>> >
>>> >--
>>> >Thanks,
>>> >Ankit Jain
>>>
>>>
>>
>

Re: 答复: HBase random read performance

Posted by Ted Yu <yu...@gmail.com>.

I looked
at src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java in
0.94

In processBatchCallback(), starting line 1538,

        // step 1: break up into regionserver-sized chunks and build the
data structs
        Map<HRegionLocation, MultiAction<R>> actionsByServer =
          new HashMap<HRegionLocation, MultiAction<R>>();
        for (int i = 0; i < workingList.size(); i++) {

So we do group individual action by server.

FYI

On Mon, Apr 15, 2013 at 6:30 AM, Ted Yu <yu...@gmail.com> wrote:

> Doug made a good point.
>
> Take a look at the performance gain for parallel scan (bottom chart
> compared to top chart):
> https://issues.apache.org/jira/secure/attachment/12578083/FDencode.png
>
> See
> https://issues.apache.org/jira/browse/HBASE-8316?focusedCommentId=13628300&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13628300for explanation of the two methods.
>
> Cheers
>
> On Mon, Apr 15, 2013 at 6:21 AM, Doug Meil <do...@explorysmedical.com>wrote:
>
>>
>> Hi there, regarding this...
>>
>> > We are passing random 10000 row-keys as input, while HBase is taking
>> around
>> > 17 secs to return 10000 records.
>>
>>
>> ….  Given that you are generating 10,000 random keys, your multi-get is
>> very likely hitting all 5 nodes of your cluster.
>>
>>
>> Historically, multi-Get used to first sort the requests by RS and then
>> *serially* go the RS to process the multi-Get.  I'm not sure of the
>> current (0.94.x) behavior if it multi-threads or not.
>>
>> One thing you might want to consider is confirming that client behavior,
>> and if it's not multi-threading then perform a test that does the same RS
>> sorting via...
>>
>>
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#
>> getRegionLocation%28byte[<http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#getRegionLocation%28byte[>
>> ]%29
>>
>> …. and then spin up your own threads (one per target RS) and see what
>> happens.
>>
>>
>>
>> On 4/15/13 9:04 AM, "Ankit Jain" <an...@gmail.com> wrote:
>>
>> >Hi Liang,
>> >
>> >Thanks Liang for reply..
>> >
>> >Ans1:
>> >I tried by using HFile block size of 32 KB and bloom filter is enabled.
>> >The
>> >random read performance is 10000 records in 23 secs.
>> >
>> >Ans2:
>> >We are retrieving all the 10000 rows in one call.
>> >
>> >Ans3:
>> >Disk detai:
>> >Model Number:       ST2000DM001-1CH164
>> >Serial Number:      Z1E276YF
>> >
>> >Please suggest some more optimization
>> >
>> >Thanks,
>> >Ankit Jain
>> >
>> >On Mon, Apr 15, 2013 at 5:11 PM, 谢良 <xi...@xiaomi.com> wrote:
>> >
>> >> First, it's probably helpless to set block size to 4KB, please refer to
>> >> the beginning of HFile.java:
>> >>
>> >>  Smaller blocks are good
>> >>  * for random access, but require more memory to hold the block index,
>> >>and
>> >> may
>> >>  * be slower to create (because we must flush the compressor stream at
>> >>the
>> >>  * conclusion of each data block, which leads to an FS I/O flush).
>> >> Further, due
>> >>  * to the internal caching in Compression codec, the smallest possible
>> >> block
>> >>  * size would be around 20KB-30KB.
>> >>
>> >> Second, is it a single-thread test client or multi-threads? we couldn't
>> >> expect too much if the requests are one by one.
>> >>
>> >> Third, could you provide more info about  your DN disk numbers and IO
>> >> utils ?
>> >>
>> >> Thanks,
>> >> Liang
>> >> ________________________________________
>> >> 发件人: Ankit Jain [ankitjaincs06@gmail.com]
>> >> 发送时间: 2013年4月15日 18:53
>> >> 收件人: user@hbase.apache.org
>> >> 主题: Re: HBase random read performance
>> >>
>> >> Hi Anoop,
>> >>
>> >> Thanks for reply..
>> >>
>> >> I tried by setting Hfile block size 4KB and also enabled the bloom
>> >> filter(ROW). The maximum read performance that I was able to achieve is
>> >> 10000 records in 14 secs (size of record is 1.6KB).
>> >>
>> >> Please suggest some tuning..
>> >>
>> >> Thanks,
>> >> Ankit Jain
>> >>
>> >>
>> >>
>> >> On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal <
>> >> rishabh.agrawal@impetus.co.in> wrote:
>> >>
>> >> > Interesting. Can you explain why this happens?
>> >> >
>> >> > -----Original Message-----
>> >> > From: Anoop Sam John [mailto:anoopsj@huawei.com]
>> >> > Sent: Monday, April 15, 2013 3:47 PM
>> >> > To: user@hbase.apache.org
>> >> > Subject: RE: HBase random read performance
>> >> >
>> >> > Ankit
>> >> >                  I guess you might be having default HFile block size
>> >> > which is 64KB.
>> >> > For random gets a lower value will be better. Try will some thing
>> like
>> >> 8KB
>> >> > and check the latency?
>> >> >
>> >> > Ya ofcourse blooms can help (if major compaction was not done at the
>> >>time
>> >> > of testing)
>> >> >
>> >> > -Anoop-
>> >> > ________________________________________
>> >> > From: Ankit Jain [ankitjaincs06@gmail.com]
>> >> > Sent: Saturday, April 13, 2013 11:01 AM
>> >> > To: user@hbase.apache.org
>> >> > Subject: HBase random read performance
>> >> >
>> >> > Hi All,
>> >> >
>> >> > We are using HBase 0.94.5 and Hadoop 1.0.4.
>> >> >
>> >> > We have HBase cluster of 5 nodes(5 regionservers and 1 master node).
>> >>Each
>> >> > regionserver has 8 GB RAM.
>> >> >
>> >> > We have loaded 25 millions records in HBase table, regions are
>> >>pre-split
>> >> > into 16 regions and all the regions are equally loaded.
>> >> >
>> >> > We are getting very low random read performance while performing
>> multi
>> >> get
>> >> > from HBase.
>> >> >
>> >> > We are passing random 10000 row-keys as input, while HBase is taking
>> >> around
>> >> > 17 secs to return 10000 records.
>> >> >
>> >> > Please suggest some tuning to increase HBase read performance.
>> >> >
>> >> > Thanks,
>> >> > Ankit Jain
>> >> > iLabs
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Thanks,
>> >> > Ankit Jain
>> >> >
>> >> > ________________________________
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > NOTE: This message may contain information that is confidential,
>> >> > proprietary, privileged or otherwise protected by law. The message is
>> >> > intended solely for the named addressee. If received in error, please
>> >> > destroy and notify the sender. Any use of this email is prohibited
>> >>when
>> >> > received in error. Impetus does not represent, warrant and/or
>> >>guarantee,
>> >> > that the integrity of this communication has been maintained nor that
>> >>the
>> >> > communication is free of errors, virus, interception or interference.
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Thanks,
>> >> Ankit Jain
>> >>
>> >
>> >
>> >
>> >--
>> >Thanks,
>> >Ankit Jain
>>
>>
>

Re: 答复: HBase random read performance

Posted by Ted Yu <yu...@gmail.com>.

Doug made a good point.

Take a look at the performance gain for parallel scan (bottom chart
compared to top chart):
https://issues.apache.org/jira/secure/attachment/12578083/FDencode.png

See
https://issues.apache.org/jira/browse/HBASE-8316?focusedCommentId=13628300&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13628300for
explanation of the two methods.

Cheers

On Mon, Apr 15, 2013 at 6:21 AM, Doug Meil <do...@explorysmedical.com>wrote:

>
> Hi there, regarding this...
>
> > We are passing random 10000 row-keys as input, while HBase is taking
> around
> > 17 secs to return 10000 records.
>
>
> ….  Given that you are generating 10,000 random keys, your multi-get is
> very likely hitting all 5 nodes of your cluster.
>
>
> Historically, multi-Get used to first sort the requests by RS and then
> *serially* go the RS to process the multi-Get.  I'm not sure of the
> current (0.94.x) behavior if it multi-threads or not.
>
> One thing you might want to consider is confirming that client behavior,
> and if it's not multi-threading then perform a test that does the same RS
> sorting via...
>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#
> getRegionLocation%28byte[]%29
>
> …. and then spin up your own threads (one per target RS) and see what
> happens.
>
>
>
> On 4/15/13 9:04 AM, "Ankit Jain" <an...@gmail.com> wrote:
>
> >Hi Liang,
> >
> >Thanks Liang for reply..
> >
> >Ans1:
> >I tried by using HFile block size of 32 KB and bloom filter is enabled.
> >The
> >random read performance is 10000 records in 23 secs.
> >
> >Ans2:
> >We are retrieving all the 10000 rows in one call.
> >
> >Ans3:
> >Disk detai:
> >Model Number:       ST2000DM001-1CH164
> >Serial Number:      Z1E276YF
> >
> >Please suggest some more optimization
> >
> >Thanks,
> >Ankit Jain
> >
> >On Mon, Apr 15, 2013 at 5:11 PM, 谢良 <xi...@xiaomi.com> wrote:
> >
> >> First, it's probably helpless to set block size to 4KB, please refer to
> >> the beginning of HFile.java:
> >>
> >>  Smaller blocks are good
> >>  * for random access, but require more memory to hold the block index,
> >>and
> >> may
> >>  * be slower to create (because we must flush the compressor stream at
> >>the
> >>  * conclusion of each data block, which leads to an FS I/O flush).
> >> Further, due
> >>  * to the internal caching in Compression codec, the smallest possible
> >> block
> >>  * size would be around 20KB-30KB.
> >>
> >> Second, is it a single-thread test client or multi-threads? we couldn't
> >> expect too much if the requests are one by one.
> >>
> >> Third, could you provide more info about  your DN disk numbers and IO
> >> utils ?
> >>
> >> Thanks,
> >> Liang
> >> ________________________________________
> >> 发件人: Ankit Jain [ankitjaincs06@gmail.com]
> >> 发送时间: 2013年4月15日 18:53
> >> 收件人: user@hbase.apache.org
> >> 主题: Re: HBase random read performance
> >>
> >> Hi Anoop,
> >>
> >> Thanks for reply..
> >>
> >> I tried by setting Hfile block size 4KB and also enabled the bloom
> >> filter(ROW). The maximum read performance that I was able to achieve is
> >> 10000 records in 14 secs (size of record is 1.6KB).
> >>
> >> Please suggest some tuning..
> >>
> >> Thanks,
> >> Ankit Jain
> >>
> >>
> >>
> >> On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal <
> >> rishabh.agrawal@impetus.co.in> wrote:
> >>
> >> > Interesting. Can you explain why this happens?
> >> >
> >> > -----Original Message-----
> >> > From: Anoop Sam John [mailto:anoopsj@huawei.com]
> >> > Sent: Monday, April 15, 2013 3:47 PM
> >> > To: user@hbase.apache.org
> >> > Subject: RE: HBase random read performance
> >> >
> >> > Ankit
> >> >                  I guess you might be having default HFile block size
> >> > which is 64KB.
> >> > For random gets a lower value will be better. Try will some thing like
> >> 8KB
> >> > and check the latency?
> >> >
> >> > Ya ofcourse blooms can help (if major compaction was not done at the
> >>time
> >> > of testing)
> >> >
> >> > -Anoop-
> >> > ________________________________________
> >> > From: Ankit Jain [ankitjaincs06@gmail.com]
> >> > Sent: Saturday, April 13, 2013 11:01 AM
> >> > To: user@hbase.apache.org
> >> > Subject: HBase random read performance
> >> >
> >> > Hi All,
> >> >
> >> > We are using HBase 0.94.5 and Hadoop 1.0.4.
> >> >
> >> > We have HBase cluster of 5 nodes(5 regionservers and 1 master node).
> >>Each
> >> > regionserver has 8 GB RAM.
> >> >
> >> > We have loaded 25 millions records in HBase table, regions are
> >>pre-split
> >> > into 16 regions and all the regions are equally loaded.
> >> >
> >> > We are getting very low random read performance while performing multi
> >> get
> >> > from HBase.
> >> >
> >> > We are passing random 10000 row-keys as input, while HBase is taking
> >> around
> >> > 17 secs to return 10000 records.
> >> >
> >> > Please suggest some tuning to increase HBase read performance.
> >> >
> >> > Thanks,
> >> > Ankit Jain
> >> > iLabs
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks,
> >> > Ankit Jain
> >> >
> >> > ________________________________
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > NOTE: This message may contain information that is confidential,
> >> > proprietary, privileged or otherwise protected by law. The message is
> >> > intended solely for the named addressee. If received in error, please
> >> > destroy and notify the sender. Any use of this email is prohibited
> >>when
> >> > received in error. Impetus does not represent, warrant and/or
> >>guarantee,
> >> > that the integrity of this communication has been maintained nor that
> >>the
> >> > communication is free of errors, virus, interception or interference.
> >> >
> >>
> >>
> >>
> >> --
> >> Thanks,
> >> Ankit Jain
> >>
> >
> >
> >
> >--
> >Thanks,
> >Ankit Jain
>
>

Re: 答复: HBase random read performance

Posted by Doug Meil <do...@explorysmedical.com>.

Hi there, regarding this...

> We are passing random 10000 row-keys as input, while HBase is taking
around
> 17 secs to return 10000 records.


….  Given that you are generating 10,000 random keys, your multi-get is
very likely hitting all 5 nodes of your cluster.


Historically, multi-Get used to first sort the requests by RS and then
*serially* go the RS to process the multi-Get.  I'm not sure of the
current (0.94.x) behavior if it multi-threads or not.

One thing you might want to consider is confirming that client behavior,
and if it's not multi-threading then perform a test that does the same RS
sorting via...

http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#
getRegionLocation%28byte[]%29

…. and then spin up your own threads (one per target RS) and see what
happens.



On 4/15/13 9:04 AM, "Ankit Jain" <an...@gmail.com> wrote:

>Hi Liang,
>
>Thanks Liang for reply..
>
>Ans1:
>I tried by using HFile block size of 32 KB and bloom filter is enabled.
>The
>random read performance is 10000 records in 23 secs.
>
>Ans2:
>We are retrieving all the 10000 rows in one call.
>
>Ans3:
>Disk detai:
>Model Number:       ST2000DM001-1CH164
>Serial Number:      Z1E276YF
>
>Please suggest some more optimization
>
>Thanks,
>Ankit Jain
>
>On Mon, Apr 15, 2013 at 5:11 PM, 谢良 <xi...@xiaomi.com> wrote:
>
>> First, it's probably helpless to set block size to 4KB, please refer to
>> the beginning of HFile.java:
>>
>>  Smaller blocks are good
>>  * for random access, but require more memory to hold the block index,
>>and
>> may
>>  * be slower to create (because we must flush the compressor stream at
>>the
>>  * conclusion of each data block, which leads to an FS I/O flush).
>> Further, due
>>  * to the internal caching in Compression codec, the smallest possible
>> block
>>  * size would be around 20KB-30KB.
>>
>> Second, is it a single-thread test client or multi-threads? we couldn't
>> expect too much if the requests are one by one.
>>
>> Third, could you provide more info about  your DN disk numbers and IO
>> utils ?
>>
>> Thanks,
>> Liang
>> ________________________________________
>> 发件人: Ankit Jain [ankitjaincs06@gmail.com]
>> 发送时间: 2013年4月15日 18:53
>> 收件人: user@hbase.apache.org
>> 主题: Re: HBase random read performance
>>
>> Hi Anoop,
>>
>> Thanks for reply..
>>
>> I tried by setting Hfile block size 4KB and also enabled the bloom
>> filter(ROW). The maximum read performance that I was able to achieve is
>> 10000 records in 14 secs (size of record is 1.6KB).
>>
>> Please suggest some tuning..
>>
>> Thanks,
>> Ankit Jain
>>
>>
>>
>> On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal <
>> rishabh.agrawal@impetus.co.in> wrote:
>>
>> > Interesting. Can you explain why this happens?
>> >
>> > -----Original Message-----
>> > From: Anoop Sam John [mailto:anoopsj@huawei.com]
>> > Sent: Monday, April 15, 2013 3:47 PM
>> > To: user@hbase.apache.org
>> > Subject: RE: HBase random read performance
>> >
>> > Ankit
>> >                  I guess you might be having default HFile block size
>> > which is 64KB.
>> > For random gets a lower value will be better. Try will some thing like
>> 8KB
>> > and check the latency?
>> >
>> > Ya ofcourse blooms can help (if major compaction was not done at the
>>time
>> > of testing)
>> >
>> > -Anoop-
>> > ________________________________________
>> > From: Ankit Jain [ankitjaincs06@gmail.com]
>> > Sent: Saturday, April 13, 2013 11:01 AM
>> > To: user@hbase.apache.org
>> > Subject: HBase random read performance
>> >
>> > Hi All,
>> >
>> > We are using HBase 0.94.5 and Hadoop 1.0.4.
>> >
>> > We have HBase cluster of 5 nodes(5 regionservers and 1 master node).
>>Each
>> > regionserver has 8 GB RAM.
>> >
>> > We have loaded 25 millions records in HBase table, regions are
>>pre-split
>> > into 16 regions and all the regions are equally loaded.
>> >
>> > We are getting very low random read performance while performing multi
>> get
>> > from HBase.
>> >
>> > We are passing random 10000 row-keys as input, while HBase is taking
>> around
>> > 17 secs to return 10000 records.
>> >
>> > Please suggest some tuning to increase HBase read performance.
>> >
>> > Thanks,
>> > Ankit Jain
>> > iLabs
>> >
>> >
>> >
>> > --
>> > Thanks,
>> > Ankit Jain
>> >
>> > ________________________________
>> >
>> >
>> >
>> >
>> >
>> >
>> > NOTE: This message may contain information that is confidential,
>> > proprietary, privileged or otherwise protected by law. The message is
>> > intended solely for the named addressee. If received in error, please
>> > destroy and notify the sender. Any use of this email is prohibited
>>when
>> > received in error. Impetus does not represent, warrant and/or
>>guarantee,
>> > that the integrity of this communication has been maintained nor that
>>the
>> > communication is free of errors, virus, interception or interference.
>> >
>>
>>
>>
>> --
>> Thanks,
>> Ankit Jain
>>
>
>
>
>-- 
>Thanks,
>Ankit Jain

Re: 答复: HBase random read performance

Posted by Ankit Jain <an...@gmail.com>.

Hi Liang,

Thanks Liang for reply..

Ans1:
I tried by using HFile block size of 32 KB and bloom filter is enabled. The
random read performance is 10000 records in 23 secs.

Ans2:
We are retrieving all the 10000 rows in one call.

Ans3:
Disk detai:
Model Number:       ST2000DM001-1CH164
Serial Number:      Z1E276YF

Please suggest some more optimization

Thanks,
Ankit Jain

On Mon, Apr 15, 2013 at 5:11 PM, 谢良 <xi...@xiaomi.com> wrote:

> First, it's probably helpless to set block size to 4KB, please refer to
> the beginning of HFile.java:
>
>  Smaller blocks are good
>  * for random access, but require more memory to hold the block index, and
> may
>  * be slower to create (because we must flush the compressor stream at the
>  * conclusion of each data block, which leads to an FS I/O flush).
> Further, due
>  * to the internal caching in Compression codec, the smallest possible
> block
>  * size would be around 20KB-30KB.
>
> Second, is it a single-thread test client or multi-threads? we couldn't
> expect too much if the requests are one by one.
>
> Third, could you provide more info about  your DN disk numbers and IO
> utils ?
>
> Thanks,
> Liang
> ________________________________________
> 发件人: Ankit Jain [ankitjaincs06@gmail.com]
> 发送时间: 2013年4月15日 18:53
> 收件人: user@hbase.apache.org
> 主题: Re: HBase random read performance
>
> Hi Anoop,
>
> Thanks for reply..
>
> I tried by setting Hfile block size 4KB and also enabled the bloom
> filter(ROW). The maximum read performance that I was able to achieve is
> 10000 records in 14 secs (size of record is 1.6KB).
>
> Please suggest some tuning..
>
> Thanks,
> Ankit Jain
>
>
>
> On Mon, Apr 15, 2013 at 4:12 PM, Rishabh Agrawal <
> rishabh.agrawal@impetus.co.in> wrote:
>
> > Interesting. Can you explain why this happens?
> >
> > -----Original Message-----
> > From: Anoop Sam John [mailto:anoopsj@huawei.com]
> > Sent: Monday, April 15, 2013 3:47 PM
> > To: user@hbase.apache.org
> > Subject: RE: HBase random read performance
> >
> > Ankit
> >                  I guess you might be having default HFile block size
> > which is 64KB.
> > For random gets a lower value will be better. Try will some thing like
> 8KB
> > and check the latency?
> >
> > Ya ofcourse blooms can help (if major compaction was not done at the time
> > of testing)
> >
> > -Anoop-
> > ________________________________________
> > From: Ankit Jain [ankitjaincs06@gmail.com]
> > Sent: Saturday, April 13, 2013 11:01 AM
> > To: user@hbase.apache.org
> > Subject: HBase random read performance
> >
> > Hi All,
> >
> > We are using HBase 0.94.5 and Hadoop 1.0.4.
> >
> > We have HBase cluster of 5 nodes(5 regionservers and 1 master node). Each
> > regionserver has 8 GB RAM.
> >
> > We have loaded 25 millions records in HBase table, regions are pre-split
> > into 16 regions and all the regions are equally loaded.
> >
> > We are getting very low random read performance while performing multi
> get
> > from HBase.
> >
> > We are passing random 10000 row-keys as input, while HBase is taking
> around
> > 17 secs to return 10000 records.
> >
> > Please suggest some tuning to increase HBase read performance.
> >
> > Thanks,
> > Ankit Jain
> > iLabs
> >
> >
> >
> > --
> > Thanks,
> > Ankit Jain
> >
> > ________________________________
> >
> >
> >
> >
> >
> >
> > NOTE: This message may contain information that is confidential,
> > proprietary, privileged or otherwise protected by law. The message is
> > intended solely for the named addressee. If received in error, please
> > destroy and notify the sender. Any use of this email is prohibited when
> > received in error. Impetus does not represent, warrant and/or guarantee,
> > that the integrity of this communication has been maintained nor that the
> > communication is free of errors, virus, interception or interference.
> >
>
>
>
> --
> Thanks,
> Ankit Jain
>



-- 
Thanks,
Ankit Jain