You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Thomas Kwan <th...@manage.com> on 2014/08/14 19:32:09 UTC

random reads

Hi there

I have a use-case where I need to do a read to check if a hbase entry
is present, then I do a put to create the entry when it is not there.

I have a script to get a list of rowkeys from hive and put them on a
HDFS directory. Then I have a MR job that reads the rowkeys and do
batch reads. I am getting around 1.5K requests per second.

To attempt to make this faster, I am wondering if I can

- sort and group the rowkeys based on regions
- make the MR jobs run on regions that have the data locally

Scan or TableInputFormat must have some codes to do something similar right?

thanks
thomas

Re: random reads

Posted by Anoop John <an...@gmail.com>.

What about your KV size  and HFile block size for the table.  For a random
read type of use case a lower value for HFile block size might help.

-Anoop-

On Fri, Aug 15, 2014 at 1:56 AM, Esteban Gutierrez <es...@cloudera.com>
wrote:

> If not set in hbase-site.xml both tcpnodelay and tcpkeepalive are set to
> true (thats the default behavior since 0.95/0.96)
>
> Have you noticed if the call processing times or the call queue is too
> high? How does IO look like when you do try to this random gets? are those
> gets going 100% of the time to disk or do you see in the metrics a good
> utilization of the block cache? (e.g. the hit ratio is high) if you think
> region servers are looking good, maybe double check if any of the nodes in
> the cluster has dropped the nic speed rate or make sure your client is not
> the bottleneck by itself. Sometimes users change the blocksize in the
> schema for a specific CF and that also helps.
>
> cheers,
> esteban.
>
>
>
> --
> Cloudera, Inc.
>
>
>
>  On Thu, Aug 14, 2014 at 12:21 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Thomas:
> > Have you set tcpnodelay to true ?
> >
> > See http://hbase.apache.org/book.html for explanation of
> > hbase.ipc.client.tcpnodelay
> >
> > Cheers
> >
> >
> > On Thu, Aug 14, 2014 at 11:41 AM, Thomas Kwan <th...@manage.com>
> > wrote:
> >
> > > Hi Esteban,
> > >
> > > Thanks for sharing ideas.
> > >
> > > We are on Hbase 0.96 and java 1.6. I have enabled short-circuit read,
> > > and heap size is around 16G for each region server. We have about 20
> > > of them.
> > >
> > > The list of rowkeys that I need to process is about 10M. I am using
> > > batch gets already and the batch size is ~2000 gets.
> > >
> > > thomas
> > >
> > > On Thu, Aug 14, 2014 at 11:01 AM, Esteban Gutierrez
> > > <es...@cloudera.com> wrote:
> > > > Hello Thomas,
> > > >
> > > > What version of HBase are you using? sorting and grouping based on
> the
> > > > regions the rows is going to help for sure. I don't think you should
> > > focus
> > > > too much in the locality side of the problem unless your HDFS input
> set
> > > is
> > > > too large (100s or 1000s of MBs per task), otherwise it might be
> faster
> > > to
> > > > load in-memory the input dataset and do the batched calls. As
> discussed
> > > in
> > > > this mailing list recently there are too many factors that might be
> > > > involved in the performance: number of threads or tasks, size of the
> > row,
> > > > RS resources, configurations, etc. so any additional info would be
> very
> > > > helpful.
> > > >
> > > > cheers,
> > > > esteban.
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Cloudera, Inc.
> > > >
> > > >
> > > >
> > > > On Thu, Aug 14, 2014 at 10:32 AM, Thomas Kwan <
> thomas.kwan@manage.com>
> > > > wrote:
> > > >
> > > >> Hi there
> > > >>
> > > >> I have a use-case where I need to do a read to check if a hbase
> entry
> > > >> is present, then I do a put to create the entry when it is not
> there.
> > > >>
> > > >> I have a script to get a list of rowkeys from hive and put them on a
> > > >> HDFS directory. Then I have a MR job that reads the rowkeys and do
> > > >> batch reads. I am getting around 1.5K requests per second.
> > > >>
> > > >> To attempt to make this faster, I am wondering if I can
> > > >>
> > > >> - sort and group the rowkeys based on regions
> > > >> - make the MR jobs run on regions that have the data locally
> > > >>
> > > >> Scan or TableInputFormat must have some codes to do something
> similar
> > > >> right?
> > > >>
> > > >> thanks
> > > >> thomas
> > > >>
> > >
> >
>

Re: random reads

Posted by Esteban Gutierrez <es...@cloudera.com>.

If not set in hbase-site.xml both tcpnodelay and tcpkeepalive are set to
true (thats the default behavior since 0.95/0.96)

Have you noticed if the call processing times or the call queue is too
high? How does IO look like when you do try to this random gets? are those
gets going 100% of the time to disk or do you see in the metrics a good
utilization of the block cache? (e.g. the hit ratio is high) if you think
region servers are looking good, maybe double check if any of the nodes in
the cluster has dropped the nic speed rate or make sure your client is not
the bottleneck by itself. Sometimes users change the blocksize in the
schema for a specific CF and that also helps.

cheers,
esteban.



--
Cloudera, Inc.



On Thu, Aug 14, 2014 at 12:21 PM, Ted Yu <yu...@gmail.com> wrote:

> Thomas:
> Have you set tcpnodelay to true ?
>
> See http://hbase.apache.org/book.html for explanation of
> hbase.ipc.client.tcpnodelay
>
> Cheers
>
>
> On Thu, Aug 14, 2014 at 11:41 AM, Thomas Kwan <th...@manage.com>
> wrote:
>
> > Hi Esteban,
> >
> > Thanks for sharing ideas.
> >
> > We are on Hbase 0.96 and java 1.6. I have enabled short-circuit read,
> > and heap size is around 16G for each region server. We have about 20
> > of them.
> >
> > The list of rowkeys that I need to process is about 10M. I am using
> > batch gets already and the batch size is ~2000 gets.
> >
> > thomas
> >
> > On Thu, Aug 14, 2014 at 11:01 AM, Esteban Gutierrez
> > <es...@cloudera.com> wrote:
> > > Hello Thomas,
> > >
> > > What version of HBase are you using? sorting and grouping based on the
> > > regions the rows is going to help for sure. I don't think you should
> > focus
> > > too much in the locality side of the problem unless your HDFS input set
> > is
> > > too large (100s or 1000s of MBs per task), otherwise it might be faster
> > to
> > > load in-memory the input dataset and do the batched calls. As discussed
> > in
> > > this mailing list recently there are too many factors that might be
> > > involved in the performance: number of threads or tasks, size of the
> row,
> > > RS resources, configurations, etc. so any additional info would be very
> > > helpful.
> > >
> > > cheers,
> > > esteban.
> > >
> > >
> > >
> > >
> > > --
> > > Cloudera, Inc.
> > >
> > >
> > >
> > > On Thu, Aug 14, 2014 at 10:32 AM, Thomas Kwan <th...@manage.com>
> > > wrote:
> > >
> > >> Hi there
> > >>
> > >> I have a use-case where I need to do a read to check if a hbase entry
> > >> is present, then I do a put to create the entry when it is not there.
> > >>
> > >> I have a script to get a list of rowkeys from hive and put them on a
> > >> HDFS directory. Then I have a MR job that reads the rowkeys and do
> > >> batch reads. I am getting around 1.5K requests per second.
> > >>
> > >> To attempt to make this faster, I am wondering if I can
> > >>
> > >> - sort and group the rowkeys based on regions
> > >> - make the MR jobs run on regions that have the data locally
> > >>
> > >> Scan or TableInputFormat must have some codes to do something similar
> > >> right?
> > >>
> > >> thanks
> > >> thomas
> > >>
> >
>

Re: random reads

Posted by Ted Yu <yu...@gmail.com>.

Thomas:
Have you set tcpnodelay to true ?

See http://hbase.apache.org/book.html for explanation of
hbase.ipc.client.tcpnodelay

Cheers


On Thu, Aug 14, 2014 at 11:41 AM, Thomas Kwan <th...@manage.com>
wrote:

> Hi Esteban,
>
> Thanks for sharing ideas.
>
> We are on Hbase 0.96 and java 1.6. I have enabled short-circuit read,
> and heap size is around 16G for each region server. We have about 20
> of them.
>
> The list of rowkeys that I need to process is about 10M. I am using
> batch gets already and the batch size is ~2000 gets.
>
> thomas
>
> On Thu, Aug 14, 2014 at 11:01 AM, Esteban Gutierrez
> <es...@cloudera.com> wrote:
> > Hello Thomas,
> >
> > What version of HBase are you using? sorting and grouping based on the
> > regions the rows is going to help for sure. I don't think you should
> focus
> > too much in the locality side of the problem unless your HDFS input set
> is
> > too large (100s or 1000s of MBs per task), otherwise it might be faster
> to
> > load in-memory the input dataset and do the batched calls. As discussed
> in
> > this mailing list recently there are too many factors that might be
> > involved in the performance: number of threads or tasks, size of the row,
> > RS resources, configurations, etc. so any additional info would be very
> > helpful.
> >
> > cheers,
> > esteban.
> >
> >
> >
> >
> > --
> > Cloudera, Inc.
> >
> >
> >
> > On Thu, Aug 14, 2014 at 10:32 AM, Thomas Kwan <th...@manage.com>
> > wrote:
> >
> >> Hi there
> >>
> >> I have a use-case where I need to do a read to check if a hbase entry
> >> is present, then I do a put to create the entry when it is not there.
> >>
> >> I have a script to get a list of rowkeys from hive and put them on a
> >> HDFS directory. Then I have a MR job that reads the rowkeys and do
> >> batch reads. I am getting around 1.5K requests per second.
> >>
> >> To attempt to make this faster, I am wondering if I can
> >>
> >> - sort and group the rowkeys based on regions
> >> - make the MR jobs run on regions that have the data locally
> >>
> >> Scan or TableInputFormat must have some codes to do something similar
> >> right?
> >>
> >> thanks
> >> thomas
> >>
>

Re: random reads

Posted by Thomas Kwan <th...@manage.com>.

Hi Esteban,

Thanks for sharing ideas.

We are on Hbase 0.96 and java 1.6. I have enabled short-circuit read,
and heap size is around 16G for each region server. We have about 20
of them.

The list of rowkeys that I need to process is about 10M. I am using
batch gets already and the batch size is ~2000 gets.

thomas

On Thu, Aug 14, 2014 at 11:01 AM, Esteban Gutierrez
<es...@cloudera.com> wrote:
> Hello Thomas,
>
> What version of HBase are you using? sorting and grouping based on the
> regions the rows is going to help for sure. I don't think you should focus
> too much in the locality side of the problem unless your HDFS input set is
> too large (100s or 1000s of MBs per task), otherwise it might be faster to
> load in-memory the input dataset and do the batched calls. As discussed in
> this mailing list recently there are too many factors that might be
> involved in the performance: number of threads or tasks, size of the row,
> RS resources, configurations, etc. so any additional info would be very
> helpful.
>
> cheers,
> esteban.
>
>
>
>
> --
> Cloudera, Inc.
>
>
>
> On Thu, Aug 14, 2014 at 10:32 AM, Thomas Kwan <th...@manage.com>
> wrote:
>
>> Hi there
>>
>> I have a use-case where I need to do a read to check if a hbase entry
>> is present, then I do a put to create the entry when it is not there.
>>
>> I have a script to get a list of rowkeys from hive and put them on a
>> HDFS directory. Then I have a MR job that reads the rowkeys and do
>> batch reads. I am getting around 1.5K requests per second.
>>
>> To attempt to make this faster, I am wondering if I can
>>
>> - sort and group the rowkeys based on regions
>> - make the MR jobs run on regions that have the data locally
>>
>> Scan or TableInputFormat must have some codes to do something similar
>> right?
>>
>> thanks
>> thomas
>>

Re: random reads

Posted by Esteban Gutierrez <es...@cloudera.com>.

Hello Thomas,

What version of HBase are you using? sorting and grouping based on the
regions the rows is going to help for sure. I don't think you should focus
too much in the locality side of the problem unless your HDFS input set is
too large (100s or 1000s of MBs per task), otherwise it might be faster to
load in-memory the input dataset and do the batched calls. As discussed in
this mailing list recently there are too many factors that might be
involved in the performance: number of threads or tasks, size of the row,
RS resources, configurations, etc. so any additional info would be very
helpful.

cheers,
esteban.

--
Cloudera, Inc.

On Thu, Aug 14, 2014 at 10:32 AM, Thomas Kwan <th...@manage.com>
wrote:

> Hi there
>
> I have a use-case where I need to do a read to check if a hbase entry
> is present, then I do a put to create the entry when it is not there.
>
> I have a script to get a list of rowkeys from hive and put them on a
> HDFS directory. Then I have a MR job that reads the rowkeys and do
> batch reads. I am getting around 1.5K requests per second.
>
> To attempt to make this faster, I am wondering if I can
>
> - sort and group the rowkeys based on regions
> - make the MR jobs run on regions that have the data locally
>
> Scan or TableInputFormat must have some codes to do something similar
> right?
>
> thanks
> thomas
>