You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "张铎(Duo Zhang)" <pa...@gmail.com> on 2022/06/12 13:25:30 UTC

[DISCUSS] Move replication queue storage from zookeeper to a separated HBase table

The issue for this is HBASE-27109[1], and it is a sub task for
HBASE-15867[2], where we want to remove the dependency on zk for
replication implementation. If HBASE-15867 is done, there is no permanent
state on zk any more, which means we are always safe to rebuild a cluster
with a fresh zk instance.

The related issues have been opened long ago, such
as HBASE-10295[3], HBASE-13773[4], etc. HBASE-15867 nearly solved the
problem as we have already abstract a replication peer storage interface
and also a replication queue storage interface, the idea is to have two
table based storages then we can solve the problem. But then we find out
there is still a cyclic dependency which could fail the startup of a
cluster. In the current replication implementation, once we create a new
WAL writer, we need to record it in the replication queue storage, before
writing data to it. But if we move the replication queue storage to a hbase
table, then we need this table to be writable first, then we can record the
new WAL file in it. On a new cluster, this will hang the cluster start up
as besides hbase:meta, no region can be online...

In HBASE-27109, I proposed a new way to track the WAL files. Please see the
design doc[5] for more details. You may find out that the implementation of
claim queues and replication log cleaner become more complicated. This is a
trade off, if we want to make the life when writing and tracking WAL
easier, then we need to deal with the complexity in other places. But I
think it is worthwhile as writing WAL is on the critical path of our main
read/write flow, where claim queues and replication log cleaner are both
background tasks.

Feel free to reply here, on the jira issue, or on the design doc.
Suggestions are always welcomed.

1. https://issues.apache.org/jira/browse/HBASE-27109
2. https://issues.apache.org/jira/browse/HBASE-15867
3. https://issues.apache.org/jira/browse/HBASE-10295
4. https://issues.apache.org/jira/browse/HBASE-13773
5.
https://docs.google.com/document/d/1QrSFlDQblxc12aTomE64sVmghrs_g5ys4fU9wGOdMHk/edit?usp=sharing

Re: [DISCUSS] Move replication queue storage from zookeeper to a separated HBase table

Posted by Nick Dimiduk <nd...@apache.org>.

Thanks for the update here and meeting minutes.

-n

On Fri, Jul 1, 2022 at 12:46 张铎(Duo Zhang) <pa...@gmail.com> wrote:

> Meetings notes:
> Attendees: Duo Zhang, Liangjun He, Xin Sun, Tianhang Tang, Yu Li
>
> First Duo explained the design doc again, then others asked some questions
> and we discussed, let me post the conclusion here:
>
> 1. If we rely on an HBase table to store the replication metadata, then how
> do we use the replication sync up tool to replicate data to the peer
> cluster once the source cluster is fully down?
> We agree that this is a limitation compared to the old zookeeper based
> implementation. Maybe we could mirror the replication metadata to another
> storage system, or use the maintenance mode to bring the hbase:replication
> table online? Not a blocker issue but at least we need to clearly document
> this.
>
> 2. Since we removed the zookeeper usage, the pressure to zookeeper will now
> be moved to HBase and HDFS, will it cause too much pressure and fail the
> cluster under extreme cases?
> After discussion, we almost agree the risk is low. The heaviest operation
> is claim queue, where we need to list HDFS, but it is the last step of SCP,
> where we have already finished WAL splitting, and it will only touch
> namenode, so in general it will not add too much pressure. Anyway, when
> implementing, we need to be careful, to avoid touching HDFS too much.
>
> 3. If hbase:replication is offline, will it hang the replication?
> This is by design, but we should try our best to not hang the normal
> read/write when hbase:replication is offline.
>
> 4. The sourceServerName in ReplicationQueueId means the last region server
> which holds the replication queue?
> No, it is the FIRST region server which holds the replication queue. The
> old design will track all the region servers which hold the replication
> queue in queue id, but actually, we only need the first region server for
> locating the WAL files.
>
> 5. How to predicate the pressure of the new hbase:replication table?
> For a normal cluster, the most pressures come from the update of the
> replication offset. This could be calculated easily with write_throughput /
> replication_size_per_offset_update. Of course, the qps will be doubled if
> the number of the replication peers is doubled.
>
> Later we talked about the general problems for replication, for example, if
> we have 20~30 replication peers, not only the pressure of the replication
> metadata will be a problem, the pressure on reading HDFS will be a big
> problem. We discussed several possible solutions, like only have one thread
> to read WAL files, not thread per peer, cache the newest several WAL files
> in memory, only have one replication peer to mirror all the WAL data to
> kafka, and use kafka to replicate to other systems, etc. Anyway, not
> related to the main topic.
>
> And we all agree that the current design doc is huge and there are still
> lots of details in each area. We will open sub tasks to cover the several
> areas and also split the design doc to several pieces and keep polishing
> it.
>
> Thanks.
>
>
> 张铎(Duo Zhang) <pa...@gmail.com> 于2022年6月29日周三 10:23写道：
>
> > We plan to hold an online meeting at 2PM to 3PM, 1st July, GMT +8, using
> > tencent meeting.
> >
> > 阿米朵 邀请您参加腾讯会议
> >> 会议主题：HBase Replication Queue Storage
> >> 会议时间：2022/07/01 14:00-15:00 (GMT+08:00) 中国标准时间 - 北京
> >>
> >> 点击链接入会，或添加至会议列表：(Click this url to join the meeting)
> >> https://meeting.tencent.com/dm/kZQdGasowxXP
> >>
> >> #腾讯会议：430-524-288 <---- This is the number of the meeting
> >> 会议密码：220701 <---- This is the password
> >>
> >> 手机一键拨号入会
> >> +8675536550000,,430524288# (中国大陆)
> >> +85230018898,,,2,430524288# (中国香港)
> >>
> >> 根据您的位置拨号
> >> +8675536550000 (中国大陆)
> >> +85230018898 (中国香港)
> >>
> >> 复制该信息，打开手机腾讯会议即可参与
> >>
> >
> > More attendees are always welcomed :)
> >
> >
> > 张铎(Duo Zhang) <pa...@gmail.com> 于2022年6月21日周二 12:46写道：
> >
> >> Liangjun He replied on jira that he wants to join the work.
> >>
> >> We plan to schedule an online meeting recently to discuss it.
> >>
> >> Will post the meeting schedule here when we find a suitable time.
> >>
> >> Feel free to join if you are interested.
> >>
> >> Thanks.
> >>
> >> 张铎(Duo Zhang) <pa...@gmail.com> 于2022年6月16日周四 22:07写道：
> >>
> >>> Thanks Andrew for the hard work on closing stale issues and let me bump
> >>> this thread...
> >>>
> >>> 张铎(Duo Zhang) <pa...@gmail.com> 于2022年6月12日周日 21:25写道：
> >>>
> >>>> The issue for this is HBASE-27109[1], and it is a sub task for
> >>>> HBASE-15867[2], where we want to remove the dependency on zk for
> >>>> replication implementation. If HBASE-15867 is done, there is no
> permanent
> >>>> state on zk any more, which means we are always safe to rebuild a
> cluster
> >>>> with a fresh zk instance.
> >>>>
> >>>> The related issues have been opened long ago, such
> >>>> as HBASE-10295[3], HBASE-13773[4], etc. HBASE-15867 nearly solved the
> >>>> problem as we have already abstract a replication peer storage
> interface
> >>>> and also a replication queue storage interface, the idea is to have
> two
> >>>> table based storages then we can solve the problem. But then we find
> out
> >>>> there is still a cyclic dependency which could fail the startup of a
> >>>> cluster. In the current replication implementation, once we create a
> new
> >>>> WAL writer, we need to record it in the replication queue storage,
> before
> >>>> writing data to it. But if we move the replication queue storage to a
> hbase
> >>>> table, then we need this table to be writable first, then we can
> record the
> >>>> new WAL file in it. On a new cluster, this will hang the cluster
> start up
> >>>> as besides hbase:meta, no region can be online...
> >>>>
> >>>> In HBASE-27109, I proposed a new way to track the WAL files. Please
> see
> >>>> the design doc[5] for more details. You may find out that the
> >>>> implementation of claim queues and replication log cleaner become more
> >>>> complicated. This is a trade off, if we want to make the life when
> writing
> >>>> and tracking WAL easier, then we need to deal with the complexity in
> other
> >>>> places. But I think it is worthwhile as writing WAL is on the
> critical path
> >>>> of our main read/write flow, where claim queues and replication log
> cleaner
> >>>> are both background tasks.
> >>>>
> >>>> Feel free to reply here, on the jira issue, or on the design doc.
> >>>> Suggestions are always welcomed.
> >>>>
> >>>> 1. https://issues.apache.org/jira/browse/HBASE-27109
> >>>> 2. https://issues.apache.org/jira/browse/HBASE-15867
> >>>> 3. https://issues.apache.org/jira/browse/HBASE-10295
> >>>> 4. https://issues.apache.org/jira/browse/HBASE-13773
> >>>> 5.
> >>>>
> https://docs.google.com/document/d/1QrSFlDQblxc12aTomE64sVmghrs_g5ys4fU9wGOdMHk/edit?usp=sharing
> >>>>
> >>>
>

Re: [DISCUSS] Move replication queue storage from zookeeper to a separated HBase table

Posted by "张铎(Duo Zhang)" <pa...@gmail.com>.

Meetings notes:
Attendees: Duo Zhang, Liangjun He, Xin Sun, Tianhang Tang, Yu Li

First Duo explained the design doc again, then others asked some questions
and we discussed, let me post the conclusion here:

1. If we rely on an HBase table to store the replication metadata, then how
do we use the replication sync up tool to replicate data to the peer
cluster once the source cluster is fully down?
We agree that this is a limitation compared to the old zookeeper based
implementation. Maybe we could mirror the replication metadata to another
storage system, or use the maintenance mode to bring the hbase:replication
table online? Not a blocker issue but at least we need to clearly document
this.

2. Since we removed the zookeeper usage, the pressure to zookeeper will now
be moved to HBase and HDFS, will it cause too much pressure and fail the
cluster under extreme cases?
After discussion, we almost agree the risk is low. The heaviest operation
is claim queue, where we need to list HDFS, but it is the last step of SCP,
where we have already finished WAL splitting, and it will only touch
namenode, so in general it will not add too much pressure. Anyway, when
implementing, we need to be careful, to avoid touching HDFS too much.

3. If hbase:replication is offline, will it hang the replication?
This is by design, but we should try our best to not hang the normal
read/write when hbase:replication is offline.

4. The sourceServerName in ReplicationQueueId means the last region server
which holds the replication queue?
No, it is the FIRST region server which holds the replication queue. The
old design will track all the region servers which hold the replication
queue in queue id, but actually, we only need the first region server for
locating the WAL files.

5. How to predicate the pressure of the new hbase:replication table?
For a normal cluster, the most pressures come from the update of the
replication offset. This could be calculated easily with write_throughput /
replication_size_per_offset_update. Of course, the qps will be doubled if
the number of the replication peers is doubled.

Later we talked about the general problems for replication, for example, if
we have 20~30 replication peers, not only the pressure of the replication
metadata will be a problem, the pressure on reading HDFS will be a big
problem. We discussed several possible solutions, like only have one thread
to read WAL files, not thread per peer, cache the newest several WAL files
in memory, only have one replication peer to mirror all the WAL data to
kafka, and use kafka to replicate to other systems, etc. Anyway, not
related to the main topic.

And we all agree that the current design doc is huge and there are still
lots of details in each area. We will open sub tasks to cover the several
areas and also split the design doc to several pieces and keep polishing it.

Thanks.


张铎(Duo Zhang) <pa...@gmail.com> 于2022年6月29日周三 10:23写道：

> We plan to hold an online meeting at 2PM to 3PM, 1st July, GMT +8, using
> tencent meeting.
>
> 阿米朵 邀请您参加腾讯会议
>> 会议主题：HBase Replication Queue Storage
>> 会议时间：2022/07/01 14:00-15:00 (GMT+08:00) 中国标准时间 - 北京
>>
>> 点击链接入会，或添加至会议列表：(Click this url to join the meeting)
>> https://meeting.tencent.com/dm/kZQdGasowxXP
>>
>> #腾讯会议：430-524-288 <---- This is the number of the meeting
>> 会议密码：220701 <---- This is the password
>>
>> 手机一键拨号入会
>> +8675536550000,,430524288# (中国大陆)
>> +85230018898,,,2,430524288# (中国香港)
>>
>> 根据您的位置拨号
>> +8675536550000 (中国大陆)
>> +85230018898 (中国香港)
>>
>> 复制该信息，打开手机腾讯会议即可参与
>>
>
> More attendees are always welcomed :)
>
>
> 张铎(Duo Zhang) <pa...@gmail.com> 于2022年6月21日周二 12:46写道：
>
>> Liangjun He replied on jira that he wants to join the work.
>>
>> We plan to schedule an online meeting recently to discuss it.
>>
>> Will post the meeting schedule here when we find a suitable time.
>>
>> Feel free to join if you are interested.
>>
>> Thanks.
>>
>> 张铎(Duo Zhang) <pa...@gmail.com> 于2022年6月16日周四 22:07写道：
>>
>>> Thanks Andrew for the hard work on closing stale issues and let me bump
>>> this thread...
>>>
>>> 张铎(Duo Zhang) <pa...@gmail.com> 于2022年6月12日周日 21:25写道：
>>>
>>>> The issue for this is HBASE-27109[1], and it is a sub task for
>>>> HBASE-15867[2], where we want to remove the dependency on zk for
>>>> replication implementation. If HBASE-15867 is done, there is no permanent
>>>> state on zk any more, which means we are always safe to rebuild a cluster
>>>> with a fresh zk instance.
>>>>
>>>> The related issues have been opened long ago, such
>>>> as HBASE-10295[3], HBASE-13773[4], etc. HBASE-15867 nearly solved the
>>>> problem as we have already abstract a replication peer storage interface
>>>> and also a replication queue storage interface, the idea is to have two
>>>> table based storages then we can solve the problem. But then we find out
>>>> there is still a cyclic dependency which could fail the startup of a
>>>> cluster. In the current replication implementation, once we create a new
>>>> WAL writer, we need to record it in the replication queue storage, before
>>>> writing data to it. But if we move the replication queue storage to a hbase
>>>> table, then we need this table to be writable first, then we can record the
>>>> new WAL file in it. On a new cluster, this will hang the cluster start up
>>>> as besides hbase:meta, no region can be online...
>>>>
>>>> In HBASE-27109, I proposed a new way to track the WAL files. Please see
>>>> the design doc[5] for more details. You may find out that the
>>>> implementation of claim queues and replication log cleaner become more
>>>> complicated. This is a trade off, if we want to make the life when writing
>>>> and tracking WAL easier, then we need to deal with the complexity in other
>>>> places. But I think it is worthwhile as writing WAL is on the critical path
>>>> of our main read/write flow, where claim queues and replication log cleaner
>>>> are both background tasks.
>>>>
>>>> Feel free to reply here, on the jira issue, or on the design doc.
>>>> Suggestions are always welcomed.
>>>>
>>>> 1. https://issues.apache.org/jira/browse/HBASE-27109
>>>> 2. https://issues.apache.org/jira/browse/HBASE-15867
>>>> 3. https://issues.apache.org/jira/browse/HBASE-10295
>>>> 4. https://issues.apache.org/jira/browse/HBASE-13773
>>>> 5.
>>>> https://docs.google.com/document/d/1QrSFlDQblxc12aTomE64sVmghrs_g5ys4fU9wGOdMHk/edit?usp=sharing
>>>>
>>>

Re: [DISCUSS] Move replication queue storage from zookeeper to a separated HBase table

Posted by "张铎(Duo Zhang)" <pa...@gmail.com>.

We plan to hold an online meeting at 2PM to 3PM, 1st July, GMT +8, using
tencent meeting.

阿米朵 邀请您参加腾讯会议
> 会议主题：HBase Replication Queue Storage
> 会议时间：2022/07/01 14:00-15:00 (GMT+08:00) 中国标准时间 - 北京
>
> 点击链接入会，或添加至会议列表：(Click this url to join the meeting)
> https://meeting.tencent.com/dm/kZQdGasowxXP
>
> #腾讯会议：430-524-288 <---- This is the number of the meeting
> 会议密码：220701 <---- This is the password
>
> 手机一键拨号入会
> +8675536550000,,430524288# (中国大陆)
> +85230018898,,,2,430524288# (中国香港)
>
> 根据您的位置拨号
> +8675536550000 (中国大陆)
> +85230018898 (中国香港)
>
> 复制该信息，打开手机腾讯会议即可参与
>

More attendees are always welcomed :)


张铎(Duo Zhang) <pa...@gmail.com> 于2022年6月21日周二 12:46写道：

> Liangjun He replied on jira that he wants to join the work.
>
> We plan to schedule an online meeting recently to discuss it.
>
> Will post the meeting schedule here when we find a suitable time.
>
> Feel free to join if you are interested.
>
> Thanks.
>
> 张铎(Duo Zhang) <pa...@gmail.com> 于2022年6月16日周四 22:07写道：
>
>> Thanks Andrew for the hard work on closing stale issues and let me bump
>> this thread...
>>
>> 张铎(Duo Zhang) <pa...@gmail.com> 于2022年6月12日周日 21:25写道：
>>
>>> The issue for this is HBASE-27109[1], and it is a sub task for
>>> HBASE-15867[2], where we want to remove the dependency on zk for
>>> replication implementation. If HBASE-15867 is done, there is no permanent
>>> state on zk any more, which means we are always safe to rebuild a cluster
>>> with a fresh zk instance.
>>>
>>> The related issues have been opened long ago, such
>>> as HBASE-10295[3], HBASE-13773[4], etc. HBASE-15867 nearly solved the
>>> problem as we have already abstract a replication peer storage interface
>>> and also a replication queue storage interface, the idea is to have two
>>> table based storages then we can solve the problem. But then we find out
>>> there is still a cyclic dependency which could fail the startup of a
>>> cluster. In the current replication implementation, once we create a new
>>> WAL writer, we need to record it in the replication queue storage, before
>>> writing data to it. But if we move the replication queue storage to a hbase
>>> table, then we need this table to be writable first, then we can record the
>>> new WAL file in it. On a new cluster, this will hang the cluster start up
>>> as besides hbase:meta, no region can be online...
>>>
>>> In HBASE-27109, I proposed a new way to track the WAL files. Please see
>>> the design doc[5] for more details. You may find out that the
>>> implementation of claim queues and replication log cleaner become more
>>> complicated. This is a trade off, if we want to make the life when writing
>>> and tracking WAL easier, then we need to deal with the complexity in other
>>> places. But I think it is worthwhile as writing WAL is on the critical path
>>> of our main read/write flow, where claim queues and replication log cleaner
>>> are both background tasks.
>>>
>>> Feel free to reply here, on the jira issue, or on the design doc.
>>> Suggestions are always welcomed.
>>>
>>> 1. https://issues.apache.org/jira/browse/HBASE-27109
>>> 2. https://issues.apache.org/jira/browse/HBASE-15867
>>> 3. https://issues.apache.org/jira/browse/HBASE-10295
>>> 4. https://issues.apache.org/jira/browse/HBASE-13773
>>> 5.
>>> https://docs.google.com/document/d/1QrSFlDQblxc12aTomE64sVmghrs_g5ys4fU9wGOdMHk/edit?usp=sharing
>>>
>>

Re: [DISCUSS] Move replication queue storage from zookeeper to a separated HBase table

Posted by "张铎(Duo Zhang)" <pa...@gmail.com>.

Liangjun He replied on jira that he wants to join the work.

We plan to schedule an online meeting recently to discuss it.

Will post the meeting schedule here when we find a suitable time.

Feel free to join if you are interested.

Thanks.

张铎(Duo Zhang) <pa...@gmail.com> 于2022年6月16日周四 22:07写道：

> Thanks Andrew for the hard work on closing stale issues and let me bump
> this thread...
>
> 张铎(Duo Zhang) <pa...@gmail.com> 于2022年6月12日周日 21:25写道：
>
>> The issue for this is HBASE-27109[1], and it is a sub task for
>> HBASE-15867[2], where we want to remove the dependency on zk for
>> replication implementation. If HBASE-15867 is done, there is no permanent
>> state on zk any more, which means we are always safe to rebuild a cluster
>> with a fresh zk instance.
>>
>> The related issues have been opened long ago, such
>> as HBASE-10295[3], HBASE-13773[4], etc. HBASE-15867 nearly solved the
>> problem as we have already abstract a replication peer storage interface
>> and also a replication queue storage interface, the idea is to have two
>> table based storages then we can solve the problem. But then we find out
>> there is still a cyclic dependency which could fail the startup of a
>> cluster. In the current replication implementation, once we create a new
>> WAL writer, we need to record it in the replication queue storage, before
>> writing data to it. But if we move the replication queue storage to a hbase
>> table, then we need this table to be writable first, then we can record the
>> new WAL file in it. On a new cluster, this will hang the cluster start up
>> as besides hbase:meta, no region can be online...
>>
>> In HBASE-27109, I proposed a new way to track the WAL files. Please see
>> the design doc[5] for more details. You may find out that the
>> implementation of claim queues and replication log cleaner become more
>> complicated. This is a trade off, if we want to make the life when writing
>> and tracking WAL easier, then we need to deal with the complexity in other
>> places. But I think it is worthwhile as writing WAL is on the critical path
>> of our main read/write flow, where claim queues and replication log cleaner
>> are both background tasks.
>>
>> Feel free to reply here, on the jira issue, or on the design doc.
>> Suggestions are always welcomed.
>>
>> 1. https://issues.apache.org/jira/browse/HBASE-27109
>> 2. https://issues.apache.org/jira/browse/HBASE-15867
>> 3. https://issues.apache.org/jira/browse/HBASE-10295
>> 4. https://issues.apache.org/jira/browse/HBASE-13773
>> 5.
>> https://docs.google.com/document/d/1QrSFlDQblxc12aTomE64sVmghrs_g5ys4fU9wGOdMHk/edit?usp=sharing
>>
>

Re: [DISCUSS] Move replication queue storage from zookeeper to a separated HBase table

Posted by "张铎(Duo Zhang)" <pa...@gmail.com>.

Thanks Andrew for the hard work on closing stale issues and let me bump
this thread...

张铎(Duo Zhang) <pa...@gmail.com> 于2022年6月12日周日 21:25写道：

> The issue for this is HBASE-27109[1], and it is a sub task for
> HBASE-15867[2], where we want to remove the dependency on zk for
> replication implementation. If HBASE-15867 is done, there is no permanent
> state on zk any more, which means we are always safe to rebuild a cluster
> with a fresh zk instance.
>
> The related issues have been opened long ago, such
> as HBASE-10295[3], HBASE-13773[4], etc. HBASE-15867 nearly solved the
> problem as we have already abstract a replication peer storage interface
> and also a replication queue storage interface, the idea is to have two
> table based storages then we can solve the problem. But then we find out
> there is still a cyclic dependency which could fail the startup of a
> cluster. In the current replication implementation, once we create a new
> WAL writer, we need to record it in the replication queue storage, before
> writing data to it. But if we move the replication queue storage to a hbase
> table, then we need this table to be writable first, then we can record the
> new WAL file in it. On a new cluster, this will hang the cluster start up
> as besides hbase:meta, no region can be online...
>
> In HBASE-27109, I proposed a new way to track the WAL files. Please see
> the design doc[5] for more details. You may find out that the
> implementation of claim queues and replication log cleaner become more
> complicated. This is a trade off, if we want to make the life when writing
> and tracking WAL easier, then we need to deal with the complexity in other
> places. But I think it is worthwhile as writing WAL is on the critical path
> of our main read/write flow, where claim queues and replication log cleaner
> are both background tasks.
>
> Feel free to reply here, on the jira issue, or on the design doc.
> Suggestions are always welcomed.
>
> 1. https://issues.apache.org/jira/browse/HBASE-27109
> 2. https://issues.apache.org/jira/browse/HBASE-15867
> 3. https://issues.apache.org/jira/browse/HBASE-10295
> 4. https://issues.apache.org/jira/browse/HBASE-13773
> 5.
> https://docs.google.com/document/d/1QrSFlDQblxc12aTomE64sVmghrs_g5ys4fU9wGOdMHk/edit?usp=sharing
>