You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by libis <li...@gmail.com> on 2017/09/04 07:06:06 UTC

should we split the scan range into serveral segments when the scan range only located in a single region?

Hi

When TableInputFormat is used to source an HBase table in a MapReduce job,
its splitter will make a map task for each region of the table. However, in
some cases, the user’s scan range may locate in a single region, resulting
in there is  a only mapper. For example, the rowkey of the table is
‘md5(userid) + timestamp’, once client want to scan the data of a specified
user in the latest month with MR, it’s much possible that there is only one
mapper working.

In order to scan data in parallel if the user's scan range located in a
single region, should we split the scan range into serveral segments within
a region?

Best,

xinxin

Re: should we split the scan range into serveral segments when the scan range only located in a single region?

Posted by libis <li...@gmail.com>.

OK, I have watched the jira.

2017-09-05 15:22 GMT+08:00 Chia-Ping Tsai <ch...@apache.org>:

> Yeah, 16894 is also a similar one. Maybe Yi Liang still work on this. Move
> this discussion to the jira.
>
> On 2017-09-05 09:53, libis <li...@gmail.com> wrote:
> > Thanks for Mikhail. I am pleasure to pick HBASE-18090 up (my jira account
> > is xinxin fan). i notice that the issue HBASE-16894(
> > https://issues.apache.org/jira/browse/HBASE-16894) tries to work on the
> > similar thing. Chia-Ping, look it?
> >
> > 2017-09-04 20:41 GMT+08:00 Chia-Ping Tsai <ch...@apache.org>:
> >
> > > Thanks for the information. Mikhail. It seems to me the issue is
> popular.
> > > libis, Could you take HBASE-18090 over? I can assign the issue to you
> if i
> > > get ur jira account.
> > >
> > > On 2017-09-04 20:26, Mikhail Antonov <ol...@gmail.com> wrote:
> > > > I've filed https://issues.apache.org/jira/browse/HBASE-18090 some
> time
> > > ago
> > > > and attached draft patch to it. It's not complete as we need some
> deeper
> > > > changes in the way we open regions (see comments) but basic stuff
> works
> > > (I
> > > > ended up going the other route and didn't have bandwidth to finish
> that -
> > > > would be great if someone picked it up)
> > > >
> > > > Mikhail
> > > >
> > > > On Mon, Sep 4, 2017 at 11:13 AM Chia-Ping Tsai <ch...@apache.org>
> > > wrote:
> > > >
> > > > > That sounds good. There are some related issue. see
> > > > > https://issues.apache.org/jira/browse/HBASE-4914 and
> > > > > https://issues.apache.org/jira/browse/HBASE-4063.
> > > > >
> > > > > On 2017-09-04 15:06, libis <li...@gmail.com> wrote:
> > > > > > Hi
> > > > > >
> > > > > > When TableInputFormat is used to source an HBase table in a
> MapReduce
> > > > > job,
> > > > > > its splitter will make a map task for each region of the table.
> > > However,
> > > > > in
> > > > > > some cases, the user’s scan range may locate in a single region,
> > > > > resulting
> > > > > > in there is  a only mapper. For example, the rowkey of the table
> is
> > > > > > ‘md5(userid) + timestamp’, once client want to scan the data of a
> > > > > specified
> > > > > > user in the latest month with MR, it’s much possible that there
> is
> > > only
> > > > > one
> > > > > > mapper working.
> > > > > >
> > > > > > In order to scan data in parallel if the user's scan range
> located
> > > in a
> > > > > > single region, should we split the scan range into serveral
> segments
> > > > > within
> > > > > > a region?
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > xinxin
> > > > > >
> > > > >
> > > > --
> > > > Thanks,
> > > > Michael Antonov
> > > >
> > >
> >
>

Re: should we split the scan range into serveral segments when the scan range only located in a single region?

Posted by Chia-Ping Tsai <ch...@apache.org>.

Yeah, 16894 is also a similar one. Maybe Yi Liang still work on this. Move this discussion to the jira.

On 2017-09-05 09:53, libis <li...@gmail.com> wrote: 
> Thanks for Mikhail. I am pleasure to pick HBASE-18090 up (my jira account
> is xinxin fan). i notice that the issue HBASE-16894(
> https://issues.apache.org/jira/browse/HBASE-16894) tries to work on the
> similar thing. Chia-Ping, look it?
> 
> 2017-09-04 20:41 GMT+08:00 Chia-Ping Tsai <ch...@apache.org>:
> 
> > Thanks for the information. Mikhail. It seems to me the issue is popular.
> > libis, Could you take HBASE-18090 over? I can assign the issue to you if i
> > get ur jira account.
> >
> > On 2017-09-04 20:26, Mikhail Antonov <ol...@gmail.com> wrote:
> > > I've filed https://issues.apache.org/jira/browse/HBASE-18090 some time
> > ago
> > > and attached draft patch to it. It's not complete as we need some deeper
> > > changes in the way we open regions (see comments) but basic stuff works
> > (I
> > > ended up going the other route and didn't have bandwidth to finish that -
> > > would be great if someone picked it up)
> > >
> > > Mikhail
> > >
> > > On Mon, Sep 4, 2017 at 11:13 AM Chia-Ping Tsai <ch...@apache.org>
> > wrote:
> > >
> > > > That sounds good. There are some related issue. see
> > > > https://issues.apache.org/jira/browse/HBASE-4914 and
> > > > https://issues.apache.org/jira/browse/HBASE-4063.
> > > >
> > > > On 2017-09-04 15:06, libis <li...@gmail.com> wrote:
> > > > > Hi
> > > > >
> > > > > When TableInputFormat is used to source an HBase table in a MapReduce
> > > > job,
> > > > > its splitter will make a map task for each region of the table.
> > However,
> > > > in
> > > > > some cases, the userâs scan range may locate in a single region,
> > > > resulting
> > > > > in there is  a only mapper. For example, the rowkey of the table is
> > > > > âmd5(userid) + timestampâ, once client want to scan the data of a
> > > > specified
> > > > > user in the latest month with MR, itâs much possible that there is
> > only
> > > > one
> > > > > mapper working.
> > > > >
> > > > > In order to scan data in parallel if the user's scan range located
> > in a
> > > > > single region, should we split the scan range into serveral segments
> > > > within
> > > > > a region?
> > > > >
> > > > > Best,
> > > > >
> > > > > xinxin
> > > > >
> > > >
> > > --
> > > Thanks,
> > > Michael Antonov
> > >
> >
>

Re: should we split the scan range into serveral segments when the scan range only located in a single region?

Posted by libis <li...@gmail.com>.

Thanks for Mikhail. I am pleasure to pick HBASE-18090 up (my jira account
is xinxin fan). i notice that the issue HBASE-16894(
https://issues.apache.org/jira/browse/HBASE-16894) tries to work on the
similar thing. Chia-Ping, look it?

2017-09-04 20:41 GMT+08:00 Chia-Ping Tsai <ch...@apache.org>:

> Thanks for the information. Mikhail. It seems to me the issue is popular.
> libis, Could you take HBASE-18090 over? I can assign the issue to you if i
> get ur jira account.
>
> On 2017-09-04 20:26, Mikhail Antonov <ol...@gmail.com> wrote:
> > I've filed https://issues.apache.org/jira/browse/HBASE-18090 some time
> ago
> > and attached draft patch to it. It's not complete as we need some deeper
> > changes in the way we open regions (see comments) but basic stuff works
> (I
> > ended up going the other route and didn't have bandwidth to finish that -
> > would be great if someone picked it up)
> >
> > Mikhail
> >
> > On Mon, Sep 4, 2017 at 11:13 AM Chia-Ping Tsai <ch...@apache.org>
> wrote:
> >
> > > That sounds good. There are some related issue. see
> > > https://issues.apache.org/jira/browse/HBASE-4914 and
> > > https://issues.apache.org/jira/browse/HBASE-4063.
> > >
> > > On 2017-09-04 15:06, libis <li...@gmail.com> wrote:
> > > > Hi
> > > >
> > > > When TableInputFormat is used to source an HBase table in a MapReduce
> > > job,
> > > > its splitter will make a map task for each region of the table.
> However,
> > > in
> > > > some cases, the user’s scan range may locate in a single region,
> > > resulting
> > > > in there is  a only mapper. For example, the rowkey of the table is
> > > > ‘md5(userid) + timestamp’, once client want to scan the data of a
> > > specified
> > > > user in the latest month with MR, it’s much possible that there is
> only
> > > one
> > > > mapper working.
> > > >
> > > > In order to scan data in parallel if the user's scan range located
> in a
> > > > single region, should we split the scan range into serveral segments
> > > within
> > > > a region?
> > > >
> > > > Best,
> > > >
> > > > xinxin
> > > >
> > >
> > --
> > Thanks,
> > Michael Antonov
> >
>

Re: should we split the scan range into serveral segments when the scan range only located in a single region?

Posted by Chia-Ping Tsai <ch...@apache.org>.

Thanks for the information. Mikhail. It seems to me the issue is popular.
libis, Could you take HBASE-18090 over? I can assign the issue to you if i get ur jira account.

On 2017-09-04 20:26, Mikhail Antonov <ol...@gmail.com> wrote: 
> I've filed https://issues.apache.org/jira/browse/HBASE-18090 some time ago
> and attached draft patch to it. It's not complete as we need some deeper
> changes in the way we open regions (see comments) but basic stuff works (I
> ended up going the other route and didn't have bandwidth to finish that -
> would be great if someone picked it up)
> 
> Mikhail
> 
> On Mon, Sep 4, 2017 at 11:13 AM Chia-Ping Tsai <ch...@apache.org> wrote:
> 
> > That sounds good. There are some related issue. see
> > https://issues.apache.org/jira/browse/HBASE-4914 and
> > https://issues.apache.org/jira/browse/HBASE-4063.
> >
> > On 2017-09-04 15:06, libis <li...@gmail.com> wrote:
> > > Hi
> > >
> > > When TableInputFormat is used to source an HBase table in a MapReduce
> > job,
> > > its splitter will make a map task for each region of the table. However,
> > in
> > > some cases, the userâs scan range may locate in a single region,
> > resulting
> > > in there is  a only mapper. For example, the rowkey of the table is
> > > âmd5(userid) + timestampâ, once client want to scan the data of a
> > specified
> > > user in the latest month with MR, itâs much possible that there is only
> > one
> > > mapper working.
> > >
> > > In order to scan data in parallel if the user's scan range located in a
> > > single region, should we split the scan range into serveral segments
> > within
> > > a region?
> > >
> > > Best,
> > >
> > > xinxin
> > >
> >
> -- 
> Thanks,
> Michael Antonov
>

Re: should we split the scan range into serveral segments when the scan range only located in a single region?

Posted by Mikhail Antonov <ol...@gmail.com>.

I've filed https://issues.apache.org/jira/browse/HBASE-18090 some time ago
and attached draft patch to it. It's not complete as we need some deeper
changes in the way we open regions (see comments) but basic stuff works (I
ended up going the other route and didn't have bandwidth to finish that -
would be great if someone picked it up)

Mikhail

On Mon, Sep 4, 2017 at 11:13 AM Chia-Ping Tsai <ch...@apache.org> wrote:

> That sounds good. There are some related issue. see
> https://issues.apache.org/jira/browse/HBASE-4914 and
> https://issues.apache.org/jira/browse/HBASE-4063.
>
> On 2017-09-04 15:06, libis <li...@gmail.com> wrote:
> > Hi
> >
> > When TableInputFormat is used to source an HBase table in a MapReduce
> job,
> > its splitter will make a map task for each region of the table. However,
> in
> > some cases, the user’s scan range may locate in a single region,
> resulting
> > in there is  a only mapper. For example, the rowkey of the table is
> > ‘md5(userid) + timestamp’, once client want to scan the data of a
> specified
> > user in the latest month with MR, it’s much possible that there is only
> one
> > mapper working.
> >
> > In order to scan data in parallel if the user's scan range located in a
> > single region, should we split the scan range into serveral segments
> within
> > a region?
> >
> > Best,
> >
> > xinxin
> >
>
-- 
Thanks,
Michael Antonov

Re: should we split the scan range into serveral segments when the scan range only located in a single region?

Posted by libis <li...@gmail.com>.

Thanks for replying promptly. oh, i think it maybe hard to set a proper
mapper number per region for a hbase user, and in that way, some small
region may create so much small jobs. however, we can simply specify a
fixed mapper number only if the scan range located in a single region which
maybe a common production scene for the large  region(>30g). what do you
think?

2017-09-04 17:13 GMT+08:00 Chia-Ping Tsai <ch...@apache.org>:

> That sounds good. There are some related issue. see
> https://issues.apache.org/jira/browse/HBASE-4914 and
> https://issues.apache.org/jira/browse/HBASE-4063.
>
> On 2017-09-04 15:06, libis <li...@gmail.com> wrote:
> > Hi
> >
> > When TableInputFormat is used to source an HBase table in a MapReduce
> job,
> > its splitter will make a map task for each region of the table. However,
> in
> > some cases, the user’s scan range may locate in a single region,
> resulting
> > in there is  a only mapper. For example, the rowkey of the table is
> > ‘md5(userid) + timestamp’, once client want to scan the data of a
> specified
> > user in the latest month with MR, it’s much possible that there is only
> one
> > mapper working.
> >
> > In order to scan data in parallel if the user's scan range located in a
> > single region, should we split the scan range into serveral segments
> within
> > a region?
> >
> > Best,
> >
> > xinxin
> >
>

Re: should we split the scan range into serveral segments when the scan range only located in a single region?

Posted by Chia-Ping Tsai <ch...@apache.org>.

That sounds good. There are some related issue. see https://issues.apache.org/jira/browse/HBASE-4914 and https://issues.apache.org/jira/browse/HBASE-4063.

On 2017-09-04 15:06, libis <li...@gmail.com> wrote: 
> Hi
> 
> When TableInputFormat is used to source an HBase table in a MapReduce job,
> its splitter will make a map task for each region of the table. However, in
> some cases, the userâs scan range may locate in a single region, resulting
> in there is  a only mapper. For example, the rowkey of the table is
> âmd5(userid) + timestampâ, once client want to scan the data of a specified
> user in the latest month with MR, itâs much possible that there is only one
> mapper working.
> 
> In order to scan data in parallel if the user's scan range located in a
> single region, should we split the scan range into serveral segments within
> a region?
> 
> Best,
> 
> xinxin
>