You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Jian Feng <ji...@shopee.com> on 2021/10/04 15:28:41 UTC

is there solution to solve hbase data screw issue

when I bootstrape a huge hbase index table, I found all keys have a prefix
'itemid:', then it caused data skew, there are 100 region servers in hbase
but only one was handle datas

Is there any way to avoid this issue on the Hudi side ?
-- 
*Jian Feng,冯健*
Shopee | Engineer | Data Infrastructure

Re: [Phishing Risk] [External] is there solution to solve hbase data screw issue

Posted by Vinoth Chandar <vi...@apache.org>.

Hi,

Hudi async compaction in general can deal with concurrent writes to the
same file group. This is done, by writer respecting all the pending
compaction plans.

Thanks
Vinoth

On Sun, Oct 17, 2021 at 8:39 AM Jian Feng <ji...@shopee.com> wrote:

> I have a question, when Delta streamer does delta commit with BloomIndex,
> if data is new , it may need to append them to the existing file group.
> meanwhile may cause concurrent issue with async compaction thread if
> compaction plan contains same file group，how Hudi avoid that？
>
> On Fri, Oct 15, 2021 at 12:50 AM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Yeah all the rate limiting code in HBaseIndex is working around for these
> > large bulk writes.
> >
> > On Tue, Oct 5, 2021 at 11:16 AM Jian Feng <ji...@shopee.com> wrote:
> >
> > > actually I met this problem when bootstrap a huge table，after changed
> > > region key split strategy，problem solved.
> > > Im glad to hear that hfile solution will work in the future，since
> > > bloomindex cannot index mor log file，hence new insert data still write
> > into
> > > parquet ，that why I choose hbase index ，get better performance.
> > >
> > > Vinoth Chandar <vi...@apache.org>于2021年10月5日 周二下午7:29写道：
> > >
> > > > +1 on that answer. It's pretty spot on.
> > > >
> > > > Even as random prefix helps with HBase balancing, the issue then
> > becomes
> > > > that you lose all the key ordering inside the Hudi table, which
> > > > can be a nice thing if you even want range pruning/indexing to be
> > > > effective.
> > > >
> > > > To paint a picture of all the work being done around this area. This
> > > work,
> > > > driven by uber engineers https://github.com/apache/hudi/pull/3508
> > could
> > > > technically solve the issue by directly reading HFiles
> > > > for the indexing, avoiding going to HBase servers. But obviously, it
> > > could
> > > > be less performant for small upsert batches than HBase (given the
> > region
> > > > servers will cache etc).
> > > > If your backing storage is a cloud/object storage, which again
> > throttles
> > > by
> > > > prefixes etc, then we could run into the same hotspotting problem
> > again.
> > > > Otherwise, for larger batches, this would be far more scalable.
> > > >
> > > >
> > > > On Mon, Oct 4, 2021 at 7:06 PM 管梓越 <gu...@bytedance.com>
> > wrote:
> > > >
> > > > > Hi jianfeng
> > > > >       As far as I know, there may not be a solution in hudi side
> yet.
> > > > > However, I have met this problem before so hope my experience could
> > > help.
> > > > > Just like other usages of hbase, adding a random prefix to rowkey
> may
> > > be
> > > > > the most universal solution to this problem.
> > > > > We may change the primary key for hudi by adding such prefix before
> > the
> > > > > data is ingested into hudi. A new column could be added to save
> > > original
> > > > > primary key for query and hide the pk of hudi.
> > > > > Also, we may have a small modification to hbase index. Copy the
> code
> > of
> > > > > hbase index, add the prefix on the aspect of query and update
> hbase.
> > By
> > > > > this way, the pk in hbase will be different with the one in hudi
> but
> > > such
> > > > > logic will be transparent to business logic. I have adopted this
> > method
> > > > in
> > > > > prod environment. Using withIndexClass config in IndexConfig could
> > > > specify
> > > > > custom index which allows the change of index without re
> compilation
> > of
> > > > the
> > > > > whole hudi project.
> > > > >
> > > > > On Mon, Oct 4, 2021, 11:29 PM <ji...@shopee.com> wrote:
> > > > > when I bootstrape a huge hbase index table, I found all keys have a
> > > > prefix
> > > > > 'itemid:', then it caused data skew, there are 100 region servers
> in
> > > > hbase
> > > > > but only one was handle datas Is there any way to avoid this issue
> on
> > > the
> > > > > Hudi side ? -- *Jian Feng,冯健* Shopee | Engineer | Data
> Infrastructure
> > > > >
> > > >
> > > --
> > > Full jian
> > > <Department> | <Function>
> > > Mobile <Mobile>
> > > Address <Office's Address>
> > >
> >
>
>
> --
> *Jian Feng,冯健*
> Shopee | Engineer | Data Infrastructure
>

Re: [Phishing Risk] [External] is there solution to solve hbase data screw issue

Posted by Jian Feng <ji...@shopee.com>.

I have a question, when Delta streamer does delta commit with BloomIndex,
if data is new , it may need to append them to the existing file group.
meanwhile may cause concurrent issue with async compaction thread if
compaction plan contains same file group，how Hudi avoid that？

On Fri, Oct 15, 2021 at 12:50 AM Vinoth Chandar <vi...@apache.org> wrote:

> Yeah all the rate limiting code in HBaseIndex is working around for these
> large bulk writes.
>
> On Tue, Oct 5, 2021 at 11:16 AM Jian Feng <ji...@shopee.com> wrote:
>
> > actually I met this problem when bootstrap a huge table，after changed
> > region key split strategy，problem solved.
> > Im glad to hear that hfile solution will work in the future，since
> > bloomindex cannot index mor log file，hence new insert data still write
> into
> > parquet ，that why I choose hbase index ，get better performance.
> >
> > Vinoth Chandar <vi...@apache.org>于2021年10月5日 周二下午7:29写道：
> >
> > > +1 on that answer. It's pretty spot on.
> > >
> > > Even as random prefix helps with HBase balancing, the issue then
> becomes
> > > that you lose all the key ordering inside the Hudi table, which
> > > can be a nice thing if you even want range pruning/indexing to be
> > > effective.
> > >
> > > To paint a picture of all the work being done around this area. This
> > work,
> > > driven by uber engineers https://github.com/apache/hudi/pull/3508
> could
> > > technically solve the issue by directly reading HFiles
> > > for the indexing, avoiding going to HBase servers. But obviously, it
> > could
> > > be less performant for small upsert batches than HBase (given the
> region
> > > servers will cache etc).
> > > If your backing storage is a cloud/object storage, which again
> throttles
> > by
> > > prefixes etc, then we could run into the same hotspotting problem
> again.
> > > Otherwise, for larger batches, this would be far more scalable.
> > >
> > >
> > > On Mon, Oct 4, 2021 at 7:06 PM 管梓越 <gu...@bytedance.com>
> wrote:
> > >
> > > > Hi jianfeng
> > > >       As far as I know, there may not be a solution in hudi side yet.
> > > > However, I have met this problem before so hope my experience could
> > help.
> > > > Just like other usages of hbase, adding a random prefix to rowkey may
> > be
> > > > the most universal solution to this problem.
> > > > We may change the primary key for hudi by adding such prefix before
> the
> > > > data is ingested into hudi. A new column could be added to save
> > original
> > > > primary key for query and hide the pk of hudi.
> > > > Also, we may have a small modification to hbase index. Copy the code
> of
> > > > hbase index, add the prefix on the aspect of query and update hbase.
> By
> > > > this way, the pk in hbase will be different with the one in hudi but
> > such
> > > > logic will be transparent to business logic. I have adopted this
> method
> > > in
> > > > prod environment. Using withIndexClass config in IndexConfig could
> > > specify
> > > > custom index which allows the change of index without re compilation
> of
> > > the
> > > > whole hudi project.
> > > >
> > > > On Mon, Oct 4, 2021, 11:29 PM <ji...@shopee.com> wrote:
> > > > when I bootstrape a huge hbase index table, I found all keys have a
> > > prefix
> > > > 'itemid:', then it caused data skew, there are 100 region servers in
> > > hbase
> > > > but only one was handle datas Is there any way to avoid this issue on
> > the
> > > > Hudi side ? -- *Jian Feng,冯健* Shopee | Engineer | Data Infrastructure
> > > >
> > >
> > --
> > Full jian
> > <Department> | <Function>
> > Mobile <Mobile>
> > Address <Office's Address>
> >
>


-- 
*Jian Feng,冯健*
Shopee | Engineer | Data Infrastructure

Re: [Phishing Risk] [External] is there solution to solve hbase data screw issue

Posted by Vinoth Chandar <vi...@apache.org>.

Yeah all the rate limiting code in HBaseIndex is working around for these
large bulk writes.

On Tue, Oct 5, 2021 at 11:16 AM Jian Feng <ji...@shopee.com> wrote:

> actually I met this problem when bootstrap a huge table，after changed
> region key split strategy，problem solved.
> Im glad to hear that hfile solution will work in the future，since
> bloomindex cannot index mor log file，hence new insert data still write into
> parquet ，that why I choose hbase index ，get better performance.
>
> Vinoth Chandar <vi...@apache.org>于2021年10月5日 周二下午7:29写道：
>
> > +1 on that answer. It's pretty spot on.
> >
> > Even as random prefix helps with HBase balancing, the issue then becomes
> > that you lose all the key ordering inside the Hudi table, which
> > can be a nice thing if you even want range pruning/indexing to be
> > effective.
> >
> > To paint a picture of all the work being done around this area. This
> work,
> > driven by uber engineers https://github.com/apache/hudi/pull/3508 could
> > technically solve the issue by directly reading HFiles
> > for the indexing, avoiding going to HBase servers. But obviously, it
> could
> > be less performant for small upsert batches than HBase (given the region
> > servers will cache etc).
> > If your backing storage is a cloud/object storage, which again throttles
> by
> > prefixes etc, then we could run into the same hotspotting problem again.
> > Otherwise, for larger batches, this would be far more scalable.
> >
> >
> > On Mon, Oct 4, 2021 at 7:06 PM 管梓越 <gu...@bytedance.com> wrote:
> >
> > > Hi jianfeng
> > >       As far as I know, there may not be a solution in hudi side yet.
> > > However, I have met this problem before so hope my experience could
> help.
> > > Just like other usages of hbase, adding a random prefix to rowkey may
> be
> > > the most universal solution to this problem.
> > > We may change the primary key for hudi by adding such prefix before the
> > > data is ingested into hudi. A new column could be added to save
> original
> > > primary key for query and hide the pk of hudi.
> > > Also, we may have a small modification to hbase index. Copy the code of
> > > hbase index, add the prefix on the aspect of query and update hbase. By
> > > this way, the pk in hbase will be different with the one in hudi but
> such
> > > logic will be transparent to business logic. I have adopted this method
> > in
> > > prod environment. Using withIndexClass config in IndexConfig could
> > specify
> > > custom index which allows the change of index without re compilation of
> > the
> > > whole hudi project.
> > >
> > > On Mon, Oct 4, 2021, 11:29 PM <ji...@shopee.com> wrote:
> > > when I bootstrape a huge hbase index table, I found all keys have a
> > prefix
> > > 'itemid:', then it caused data skew, there are 100 region servers in
> > hbase
> > > but only one was handle datas Is there any way to avoid this issue on
> the
> > > Hudi side ? -- *Jian Feng,冯健* Shopee | Engineer | Data Infrastructure
> > >
> >
> --
> Full jian
> <Department> | <Function>
> Mobile <Mobile>
> Address <Office's Address>
>

Re: [Phishing Risk] [External] is there solution to solve hbase data screw issue

Posted by Jian Feng <ji...@shopee.com>.

actually I met this problem when bootstrap a huge table，after changed
region key split strategy，problem solved.
Im glad to hear that hfile solution will work in the future，since
bloomindex cannot index mor log file，hence new insert data still write into
parquet ，that why I choose hbase index ，get better performance.

Vinoth Chandar <vi...@apache.org>于2021年10月5日 周二下午7:29写道：

> +1 on that answer. It's pretty spot on.
>
> Even as random prefix helps with HBase balancing, the issue then becomes
> that you lose all the key ordering inside the Hudi table, which
> can be a nice thing if you even want range pruning/indexing to be
> effective.
>
> To paint a picture of all the work being done around this area. This work,
> driven by uber engineers https://github.com/apache/hudi/pull/3508 could
> technically solve the issue by directly reading HFiles
> for the indexing, avoiding going to HBase servers. But obviously, it could
> be less performant for small upsert batches than HBase (given the region
> servers will cache etc).
> If your backing storage is a cloud/object storage, which again throttles by
> prefixes etc, then we could run into the same hotspotting problem again.
> Otherwise, for larger batches, this would be far more scalable.
>
>
> On Mon, Oct 4, 2021 at 7:06 PM 管梓越 <gu...@bytedance.com> wrote:
>
> > Hi jianfeng
> >       As far as I know, there may not be a solution in hudi side yet.
> > However, I have met this problem before so hope my experience could help.
> > Just like other usages of hbase, adding a random prefix to rowkey may be
> > the most universal solution to this problem.
> > We may change the primary key for hudi by adding such prefix before the
> > data is ingested into hudi. A new column could be added to save original
> > primary key for query and hide the pk of hudi.
> > Also, we may have a small modification to hbase index. Copy the code of
> > hbase index, add the prefix on the aspect of query and update hbase. By
> > this way, the pk in hbase will be different with the one in hudi but such
> > logic will be transparent to business logic. I have adopted this method
> in
> > prod environment. Using withIndexClass config in IndexConfig could
> specify
> > custom index which allows the change of index without re compilation of
> the
> > whole hudi project.
> >
> > On Mon, Oct 4, 2021, 11:29 PM <ji...@shopee.com> wrote:
> > when I bootstrape a huge hbase index table, I found all keys have a
> prefix
> > 'itemid:', then it caused data skew, there are 100 region servers in
> hbase
> > but only one was handle datas Is there any way to avoid this issue on the
> > Hudi side ? -- *Jian Feng,冯健* Shopee | Engineer | Data Infrastructure
> >
>
-- 
Full jian
<Department> | <Function>
Mobile <Mobile>
Address <Office's Address>

Re: [Phishing Risk] [External] is there solution to solve hbase data screw issue

Posted by Vinoth Chandar <vi...@apache.org>.

+1 on that answer. It's pretty spot on.

Even as random prefix helps with HBase balancing, the issue then becomes
that you lose all the key ordering inside the Hudi table, which
can be a nice thing if you even want range pruning/indexing to be
effective.

To paint a picture of all the work being done around this area. This work,
driven by uber engineers https://github.com/apache/hudi/pull/3508 could
technically solve the issue by directly reading HFiles
for the indexing, avoiding going to HBase servers. But obviously, it could
be less performant for small upsert batches than HBase (given the region
servers will cache etc).
If your backing storage is a cloud/object storage, which again throttles by
prefixes etc, then we could run into the same hotspotting problem again.
Otherwise, for larger batches, this would be far more scalable.

On Mon, Oct 4, 2021 at 7:06 PM 管梓越 <gu...@bytedance.com> wrote:

> Hi jianfeng
>       As far as I know, there may not be a solution in hudi side yet.
> However, I have met this problem before so hope my experience could help.
> Just like other usages of hbase, adding a random prefix to rowkey may be
> the most universal solution to this problem.
> We may change the primary key for hudi by adding such prefix before the
> data is ingested into hudi. A new column could be added to save original
> primary key for query and hide the pk of hudi.
> Also, we may have a small modification to hbase index. Copy the code of
> hbase index, add the prefix on the aspect of query and update hbase. By
> this way, the pk in hbase will be different with the one in hudi but such
> logic will be transparent to business logic. I have adopted this method in
> prod environment. Using withIndexClass config in IndexConfig could specify
> custom index which allows the change of index without re compilation of the
> whole hudi project.
>
> On Mon, Oct 4, 2021, 11:29 PM <ji...@shopee.com> wrote:
> when I bootstrape a huge hbase index table, I found all keys have a prefix
> 'itemid:', then it caused data skew, there are 100 region servers in hbase
> but only one was handle datas Is there any way to avoid this issue on the
> Hudi side ? -- *Jian Feng,冯健* Shopee | Engineer | Data Infrastructure
>

Re: [Phishing Risk] [External] is there solution to solve hbase data screw issue

Posted by 管梓越 <gu...@bytedance.com>.

Hi jianfeng
      As far as I know, there may not be a solution in hudi side yet.
However, I have met this problem before so hope my experience could help.
Just like other usages of hbase, adding a random prefix to rowkey may be
the most universal solution to this problem.
We may change the primary key for hudi by adding such prefix before the
data is ingested into hudi. A new column could be added to save original
primary key for query and hide the pk of hudi.
Also, we may have a small modification to hbase index. Copy the code of
hbase index, add the prefix on the aspect of query and update hbase. By
this way, the pk in hbase will be different with the one in hudi but such
logic will be transparent to business logic. I have adopted this method in
prod environment. Using withIndexClass config in IndexConfig could specify
custom index which allows the change of index without re compilation of the
whole hudi project.

On Mon, Oct 4, 2021, 11:29 PM <ji...@shopee.com> wrote:
when I bootstrape a huge hbase index table, I found all keys have a prefix
'itemid:', then it caused data skew, there are 100 region servers in hbase
but only one was handle datas Is there any way to avoid this issue on the
Hudi side ? -- *Jian Feng,冯健* Shopee | Engineer | Data Infrastructure