You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Henning Blohm <he...@zfabrik.de> on 2016/01/25 19:29:32 UTC

parallel scanning?

Hi,

I am looking for advise on an HBase mass data access optimization problem.

In our application all data records stored in Hbase have a time 
dimension (as inverted time) and a GUID in the row key. Retrieving a 
record requires issueing a scan with the GUID as prefix.

In order to get to entry (there is various access paths) we use a simple 
secondary index that also has a time dimension in the row and so needs a 
scan as well.

For mass updates I am currently seeking ways to improve lookup performance.

I found various discussions and issues on multi-scans (as in multi-Get, 
multi-Delete) but none of it was really helpful in sorting out the most 
promising direction.

Currently I am experimenting with simply parallelizing lookups in chunks 
from the client. That reduces eplapsed wait time a bit. It seems though 
that avoiding roundtrips altogether by "scanning in parallel 
server-side" should show much better improvements.

Is there anything like that already available that I should look into?

Thanks,
Henning

-- 
Henning Blohm

*ZFabrik Software GmbH & Co. KG*

T: 	+49 6227 3984255
F: 	+49 6227 3984254
M: 	+49 1781891820

Lammstrasse 2 69190 Walldorf

henning.blohm@zfabrik.de <ma...@zfabrik.de>
Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
ZFabrik <http://www.zfabrik.de>
Blog <http://www.z2-environment.net/blog>
Z2-Environment <http://www.z2-environment.eu>
Z2 Wiki <http://redmine.z2-environment.net>


Re: parallel scanning?

Posted by Stack <st...@duboce.net>.
On Mon, Jan 25, 2016 at 10:29 AM, Henning Blohm <he...@zfabrik.de>
wrote:

> Hi,
>
> I am looking for advise on an HBase mass data access optimization problem.
>
> In our application all data records stored in Hbase have a time dimension
> (as inverted time) and a GUID in the row key. Retrieving a record requires
> issueing a scan with the GUID as prefix.
>
>
So GUID precedes the inverted timestamp?



> In order to get to entry (there is various access paths) we use a simple
> secondary index that also has a time dimension in the row and so needs a
> scan as well.
>
> For mass updates I am currently seeking ways to improve lookup performance.
>
> I found various discussions and issues on multi-scans (as in multi-Get,
> multi-Delete) but none of it was really helpful in sorting out the most
> promising direction.
>
>
The multi-Get does not help? Downside is one slow server slows the whole
query. It is not satisfactorily parallel enough in its querying?



> Currently I am experimenting with simply parallelizing lookups in chunks
> from the client. That reduces eplapsed wait time a bit. It seems though
> that avoiding roundtrips altogether by "scanning in parallel server-side"
> should show much better improvements.
>


How would this work? You'd pass over a list of GUIDs you knew were on a
particular server, then in a coprocessor, we'd do whatever per GUID?

St.Ack



> Thanks,
> Henning
>
> --
> Henning Blohm
>
> *ZFabrik Software GmbH & Co. KG*
>
> T:      +49 6227 3984255
> F:      +49 6227 3984254
> M:      +49 1781891820
>
> Lammstrasse 2 69190 Walldorf
>
> henning.blohm@zfabrik.de <ma...@zfabrik.de>
> Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
> ZFabrik <http://www.zfabrik.de>
> Blog <http://www.z2-environment.net/blog>
> Z2-Environment <http://www.z2-environment.eu>
> Z2 Wiki <http://redmine.z2-environment.net>
>
>

Re: parallel scanning?

Posted by Ted Yu <yu...@gmail.com>.
bq. we can write twice/multi-time with no problem

If you always write twice, the latency would go up. Yet, there is no
guarantee that one of the writes would be successful.

On Fri, Feb 5, 2016 at 9:48 PM, Jameson Li <ho...@gmail.com> wrote:

> ''By line, did you mean number of rows ?
>
> Yes, sorry for my poor English.
>
> ''In the above case, handling failed write (to the second table) becomes a
> bit tricky.
>
> Yes, But I think sometimes write question will can solve easier than read,
> and that sometimes we can write twice/multi-time with no problem(premise do
> not operate column timestamp)
>
>
>
>
> 2016-02-05 20:13 GMT+08:00 Ted Yu <yu...@gmail.com>:
>
> > bq. when the result line is so much lines
> >
> > By line, did you mean number of rows ?
> >
> > bq. one table with rowkey as A_B_time, another as B_A_time
> >
> > In the above case, handling failed write (to the second table) becomes a
> > bit tricky.
> >
> > Cheers
> >
> > On Fri, Feb 5, 2016 at 12:08 AM, Jameson Li <ho...@gmail.com> wrote:
> >
> > > 2016-01-26 2:29 GMT+08:00 Henning Blohm <he...@zfabrik.de>:
> > >
> > > > I am looking for advise on an HBase mass data access optimization
> > > problem.
> > > >
> > >
> > > For multi-get and multi-scan:
> > > In my opion, multi-get(make less line) can work in realtime query, but
> > > multi-scan maybe work but it will let server busy easy and effect other
> > > small-query to a big query time.
> > > But multi-get's query time will not stable, when one of the region is
> > busy
> > > the whole time will up.
> > >
> > > For realtime and offline:
> > > watch your real query result, when the result line is so much lines,
> like
> > > Mbyte or 10Mbyte, it's quert time will not so good as miliseconds,
> > because
> > > of the network trans time. We must reduce the result lines or result
> > sizes
> > > or result columns. or it is not suit the real-realtime query.
> > > if actually need so much querys and so much big-szie results, suggest
> to
> > > work with offline and parallel, but not realtime, because also the
> server
> > > network-through will not work(1000M BIT NIC for 2M byte/qps, a server
> > just
> > > handler 50qps).
> > >
> > > if just the query issue(multi-scan and multi-get), I think we can waste
> > > store to up the query performance, just using an extra table(maybe will
> > > write twice) and using another schema, eg: one table with rowkey as
> > > A_B_time, another as B_A_time, when query B%, we just query table
> rowkey
> > > B_A_time that just one small-scan, and not need for query table row
> > > A_B_time with multi_scans.
> > >
> > > Hope helpful for U.
> > >
> > >
> > >
> > >
> > > --
> > >
> > >
> > > Thanks & Regards,
> > > 李剑 Jameson Li
> > > Focus on Hadoop,Mysql
> > >
> >
>
>
>
> --
>
>
> Thanks & Regards,
> 李剑 Jameson Li
> Focus on Hadoop,Mysql
>

Re: parallel scanning?

Posted by Jameson Li <ho...@gmail.com>.
''By line, did you mean number of rows ?

Yes, sorry for my poor English.

''In the above case, handling failed write (to the second table) becomes a
bit tricky.

Yes, But I think sometimes write question will can solve easier than read,
and that sometimes we can write twice/multi-time with no problem(premise do
not operate column timestamp)




2016-02-05 20:13 GMT+08:00 Ted Yu <yu...@gmail.com>:

> bq. when the result line is so much lines
>
> By line, did you mean number of rows ?
>
> bq. one table with rowkey as A_B_time, another as B_A_time
>
> In the above case, handling failed write (to the second table) becomes a
> bit tricky.
>
> Cheers
>
> On Fri, Feb 5, 2016 at 12:08 AM, Jameson Li <ho...@gmail.com> wrote:
>
> > 2016-01-26 2:29 GMT+08:00 Henning Blohm <he...@zfabrik.de>:
> >
> > > I am looking for advise on an HBase mass data access optimization
> > problem.
> > >
> >
> > For multi-get and multi-scan:
> > In my opion, multi-get(make less line) can work in realtime query, but
> > multi-scan maybe work but it will let server busy easy and effect other
> > small-query to a big query time.
> > But multi-get's query time will not stable, when one of the region is
> busy
> > the whole time will up.
> >
> > For realtime and offline:
> > watch your real query result, when the result line is so much lines, like
> > Mbyte or 10Mbyte, it's quert time will not so good as miliseconds,
> because
> > of the network trans time. We must reduce the result lines or result
> sizes
> > or result columns. or it is not suit the real-realtime query.
> > if actually need so much querys and so much big-szie results, suggest to
> > work with offline and parallel, but not realtime, because also the server
> > network-through will not work(1000M BIT NIC for 2M byte/qps, a server
> just
> > handler 50qps).
> >
> > if just the query issue(multi-scan and multi-get), I think we can waste
> > store to up the query performance, just using an extra table(maybe will
> > write twice) and using another schema, eg: one table with rowkey as
> > A_B_time, another as B_A_time, when query B%, we just query table rowkey
> > B_A_time that just one small-scan, and not need for query table row
> > A_B_time with multi_scans.
> >
> > Hope helpful for U.
> >
> >
> >
> >
> > --
> >
> >
> > Thanks & Regards,
> > 李剑 Jameson Li
> > Focus on Hadoop,Mysql
> >
>



-- 


Thanks & Regards,
李剑 Jameson Li
Focus on Hadoop,Mysql

Re: parallel scanning?

Posted by Ted Yu <yu...@gmail.com>.
bq. when the result line is so much lines

By line, did you mean number of rows ?

bq. one table with rowkey as A_B_time, another as B_A_time

In the above case, handling failed write (to the second table) becomes a
bit tricky.

Cheers

On Fri, Feb 5, 2016 at 12:08 AM, Jameson Li <ho...@gmail.com> wrote:

> 2016-01-26 2:29 GMT+08:00 Henning Blohm <he...@zfabrik.de>:
>
> > I am looking for advise on an HBase mass data access optimization
> problem.
> >
>
> For multi-get and multi-scan:
> In my opion, multi-get(make less line) can work in realtime query, but
> multi-scan maybe work but it will let server busy easy and effect other
> small-query to a big query time.
> But multi-get's query time will not stable, when one of the region is busy
> the whole time will up.
>
> For realtime and offline:
> watch your real query result, when the result line is so much lines, like
> Mbyte or 10Mbyte, it's quert time will not so good as miliseconds, because
> of the network trans time. We must reduce the result lines or result sizes
> or result columns. or it is not suit the real-realtime query.
> if actually need so much querys and so much big-szie results, suggest to
> work with offline and parallel, but not realtime, because also the server
> network-through will not work(1000M BIT NIC for 2M byte/qps, a server just
> handler 50qps).
>
> if just the query issue(multi-scan and multi-get), I think we can waste
> store to up the query performance, just using an extra table(maybe will
> write twice) and using another schema, eg: one table with rowkey as
> A_B_time, another as B_A_time, when query B%, we just query table rowkey
> B_A_time that just one small-scan, and not need for query table row
> A_B_time with multi_scans.
>
> Hope helpful for U.
>
>
>
>
> --
>
>
> Thanks & Regards,
> 李剑 Jameson Li
> Focus on Hadoop,Mysql
>

Re: parallel scanning?

Posted by Jameson Li <ho...@gmail.com>.
2016-01-26 2:29 GMT+08:00 Henning Blohm <he...@zfabrik.de>:

> I am looking for advise on an HBase mass data access optimization problem.
>

For multi-get and multi-scan:
In my opion, multi-get(make less line) can work in realtime query, but
multi-scan maybe work but it will let server busy easy and effect other
small-query to a big query time.
But multi-get's query time will not stable, when one of the region is busy
the whole time will up.

For realtime and offline:
watch your real query result, when the result line is so much lines, like
Mbyte or 10Mbyte, it's quert time will not so good as miliseconds, because
of the network trans time. We must reduce the result lines or result sizes
or result columns. or it is not suit the real-realtime query.
if actually need so much querys and so much big-szie results, suggest to
work with offline and parallel, but not realtime, because also the server
network-through will not work(1000M BIT NIC for 2M byte/qps, a server just
handler 50qps).

if just the query issue(multi-scan and multi-get), I think we can waste
store to up the query performance, just using an extra table(maybe will
write twice) and using another schema, eg: one table with rowkey as
A_B_time, another as B_A_time, when query B%, we just query table rowkey
B_A_time that just one small-scan, and not need for query table row
A_B_time with multi_scans.

Hope helpful for U.




-- 


Thanks & Regards,
李剑 Jameson Li
Focus on Hadoop,Mysql