You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ozone.apache.org by Kota Uenishi <ko...@preferred.jp> on 2021/11/01 00:29:06 UTC

Re: Design doc to fix HDDS-5905

Thank you for the review, Lokesh and Bharat.

I understand that transaction id would be better than timestamp,
especially because the computation cost of getting timestamp. In this
case, requirement for the sorting of deletion keys has not to be
strictly monotonic, but just mild monotonicity, like where clock skews
in the range of ours or days would be acceptable. I'll update the doc.

My question is that, is transaction index always available for non-HA
cluster? For example, our 1.1.0 cluster is not using HA for OM nor for
SCM and we are not planning to upgrade to even
single-node Ratis (still using
org.apache.hadoop.hdds.scm.pipeline.leader.choose.algorithms.DefaultLeaderChoosePolicy
for ozone.scm.pipeline.leader-choose.policy).

Bharat, on RepeatedKeyInfo;
Yes, in my plan, RepeatedKeyInfo is still needed for data format
compatibility and I'm not planning to change proto. Especially,
changing proto format will make upgrade & downgrade extremely
difficult IMO. I know it doesn't have to be a list any more, but it's
just in theory.

On Sat, Oct 30, 2021 at 4:45 AM Bharat Viswanadham
<bv...@cloudera.com.invalid> wrote:
>
> Hi Kota,
> Thanks for taking up HDDS-5905 and quickly coming up with a design.
>
> I liked the overall approach, but one thing instead of timestamps, I agree
> with Lokesh, we can use transaction index, and also this will make
> implementation easy. (As with timestamp, we need to propagate this from the
> leader, handle clock skews, and need to handle leader changes.
>
> And one question, so do we plan to use RepeatedKeyInfo, now with this
> change it will be no more list. You are not planning to change proto?
>
>
> Thanks,
> Bharat
>
>
> On Thu, Oct 28, 2021 at 11:12 PM Lokesh Jain <lj...@apache.org> wrote:
>
> > Hey Kota
> >
> > I really like the proposed approach because it makes sure that blocks are
> > deleted in order of key deletion. I would suggest using Ratis transaction
> > id as the prefix. I don’t think we will need a random suffix with that
> > approach as transaction id would avoid any collisions. Further it avoid the
> > cost of generating timestamps.
> >
> > Thanks
> > Lokesh
> >
> > > On 29-Oct-2021, at 7:52 AM, Kota Uenishi <ko...@preferred.jp> wrote:
> > >
> > > Hi Bharat & devs,
> > >
> > > I've written up some of my idea to fix HDDS-5905, which is a
> > > block-leak issue mentioned by Bharat. It involves some data format
> > > change in deletion table, so I want to get broader range of feedback
> > > from committers in addition to Bharat. If it looks good to you, I want
> > > to start writing up a patch. Please take a look!
> > >
> > > The proposal:
> > https://docs.google.com/document/d/1KeyhiE1i5SqRSgLy-pIOGW9X6mUYb8iYEkEoDAEQD9Q/edit#
> > > HDDS-5905: https://issues.apache.org/jira/browse/HDDS-5905
> > >
> > > --
> > > --
> > > Kota UENISHI, Engineer
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
> > > For additional commands, e-mail: dev-help@ozone.apache.org
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
> > For additional commands, e-mail: dev-help@ozone.apache.org
> >
> >



-- 
--
Kota UENISHI, Engineer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
For additional commands, e-mail: dev-help@ozone.apache.org


Re: Design doc to fix HDDS-5905

Posted by Kota Uenishi <ko...@preferred.jp>.
Important correction: *Thus, the max value of long does not have the
first bit as 1.*

On Mon, Nov 8, 2021 at 6:49 PM Kota Uenishi <ko...@preferred.jp> wrote:
>
> Hi Bharat,
>
> Thank you for the suggestion of object ID. By design, I understand
> that object ID is more suitable for delete table use case, regarding
> the requirement for monotonicity. I took a glance on HDDS-4315 and I
> have one question.
>
> By looking the code, the object ID seems to have the most significant
> 2 bits as epoch ID. But it's mostly implemented by Java's primitive
> type of long, which is a signed integer. Thus, the max value of long
> does not have the first bit as 0. That said, object IDs in epoch 2 and
> 3 are supposed to have negative value in long, and in that case,
> monotonicity in integer comparison will be broken. I doubt if it's
> safe in comparing object IDs. The comparison would only be safe by
> encoding into binary or unsigned hex array - but it's not
> straightforward and the comparison could be buggy IMO.
> If the epoch is only supposed to range from 0 to 1, it would be safe.
> Can we assume it, or is the comparison is always supposed to be safe?
>
> > One thing I just want to say, we recommend HA or ratis enabled.
>
> Thank you for the advice. Our cluster runs 1.1.0 but we explicitly
> disabled Ratis when upgrading to 1.1 from 1.0. So I guess it's still
> safe. Maybe enabling Ratis after upgrading to 1.2 would be safe
> regarding the object ID issue in 1.1, if I understand correctly.
>
> Thanks, and sorry for being late,
> Kota
>
> On Tue, Nov 2, 2021 at 2:12 AM Bharat Viswanadham
> <bv...@cloudera.com.invalid> wrote:
> >
> > Hi Kota,
> >
> > >My question is that, is transaction index always available for non-HA
> > >cluster?
> >
> > Yes, transaction index is available for non-HA also. But when you move from
> > non-HA to non-HA the transaction index starts again from 0, as it is a
> > newly setup cluster and ratis transaction index starts from 0 again. So, to
> > avoid the issue of object ID's colliding, we have generated a unique Object
> > ID based on transaction ID and also persisting transaction ID and starting
> > from that after restarts(HDDS-4315). Maybe we can use ObjectID to not
> > collide in an upgrade scenario from non-HA to HA here also.
> >
> > *Example Scenario *where it might cause problem using transaction index:
> > (This is like a very theoretical example)
> > Lets say 100 transaction Id delete key1 before upgrade
> > Now 100 transaction id delete key1 after upgrade, we might miss block clean
> > up. (Like the scenario described in HDDS-5905)
> >
> > Considering the above issue, I am thinking using transaction ID might be an
> > issue, otherwise for HA/ratis enabled deployments for single nodes using
> > transaction ID we should be good.
> >
> >
> > One thing I just want to say, we recommend HA or ratis enabled. (As before
> > HDDS-4315, we have a problem of generating transaction IDs from 0 again
> > after a restart, which might not have unique object ID's in the cluster.
> > And also we have enabled ratis enabled by default from 1.1.0 release (
> > HDDS-4498 <https://issues.apache.org/jira/browse/HDDS-4498>)
> >
> >
> >
> > Thanks,
> > Bharat
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Thanks,
> > Bharat
> >
> >
> >
> > On Sun, Oct 31, 2021 at 5:29 PM Kota Uenishi <ko...@preferred.jp> wrote:
> >
> > > Thank you for the review, Lokesh and Bharat.
> > >
> > > I understand that transaction id would be better than timestamp,
> > > especially because the computation cost of getting timestamp. In this
> > > case, requirement for the sorting of deletion keys has not to be
> > > strictly monotonic, but just mild monotonicity, like where clock skews
> > > in the range of ours or days would be acceptable. I'll update the doc.
> > >
> > > My question is that, is transaction index always available for non-HA
> > > cluster? For example, our 1.1.0 cluster is not using HA for OM nor for
> > > SCM and we are not planning to upgrade to even
> > > single-node Ratis (still using
> > >
> > > org.apache.hadoop.hdds.scm.pipeline.leader.choose.algorithms.DefaultLeaderChoosePolicy
> > > for ozone.scm.pipeline.leader-choose.policy).
> > >
> > > Bharat, on RepeatedKeyInfo;
> > > Yes, in my plan, RepeatedKeyInfo is still needed for data format
> > > compatibility and I'm not planning to change proto. Especially,
> > > changing proto format will make upgrade & downgrade extremely
> > > difficult IMO. I know it doesn't have to be a list any more, but it's
> > > just in theory.
> > >
> > > On Sat, Oct 30, 2021 at 4:45 AM Bharat Viswanadham
> > > <bv...@cloudera.com.invalid> wrote:
> > > >
> > > > Hi Kota,
> > > > Thanks for taking up HDDS-5905 and quickly coming up with a design.
> > > >
> > > > I liked the overall approach, but one thing instead of timestamps, I
> > > agree
> > > > with Lokesh, we can use transaction index, and also this will make
> > > > implementation easy. (As with timestamp, we need to propagate this from
> > > the
> > > > leader, handle clock skews, and need to handle leader changes.
> > > >
> > > > And one question, so do we plan to use RepeatedKeyInfo, now with this
> > > > change it will be no more list. You are not planning to change proto?
> > > >
> > > >
> > > > Thanks,
> > > > Bharat
> > > >
> > > >
> > > > On Thu, Oct 28, 2021 at 11:12 PM Lokesh Jain <lj...@apache.org> wrote:
> > > >
> > > > > Hey Kota
> > > > >
> > > > > I really like the proposed approach because it makes sure that blocks
> > > are
> > > > > deleted in order of key deletion. I would suggest using Ratis
> > > transaction
> > > > > id as the prefix. I don’t think we will need a random suffix with that
> > > > > approach as transaction id would avoid any collisions. Further it
> > > avoid the
> > > > > cost of generating timestamps.
> > > > >
> > > > > Thanks
> > > > > Lokesh
> > > > >
> > > > > > On 29-Oct-2021, at 7:52 AM, Kota Uenishi <ko...@preferred.jp> wrote:
> > > > > >
> > > > > > Hi Bharat & devs,
> > > > > >
> > > > > > I've written up some of my idea to fix HDDS-5905, which is a
> > > > > > block-leak issue mentioned by Bharat. It involves some data format
> > > > > > change in deletion table, so I want to get broader range of feedback
> > > > > > from committers in addition to Bharat. If it looks good to you, I
> > > want
> > > > > > to start writing up a patch. Please take a look!
> > > > > >
> > > > > > The proposal:
> > > > >
> > > https://docs.google.com/document/d/1KeyhiE1i5SqRSgLy-pIOGW9X6mUYb8iYEkEoDAEQD9Q/edit#
> > > > > > HDDS-5905: https://issues.apache.org/jira/browse/HDDS-5905
> > > > > >
> > > > > > --
> > > > > > --
> > > > > > Kota UENISHI, Engineer
> > > > > >
> > > > > > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
> > > > > > For additional commands, e-mail: dev-help@ozone.apache.org
> > > > > >
> > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
> > > > > For additional commands, e-mail: dev-help@ozone.apache.org
> > > > >
> > > > >
> > >
> > >
> > >
> > > --
> > > --
> > > Kota UENISHI, Engineer
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
> > > For additional commands, e-mail: dev-help@ozone.apache.org
> > >
> > >
>
>
>
> --
> --
> Kota UENISHI, Engineer



-- 
--
Kota UENISHI, Engineer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
For additional commands, e-mail: dev-help@ozone.apache.org


Re: Design doc to fix HDDS-5905

Posted by Kota Uenishi <ko...@preferred.jp>.
Hi Bharat,

Thank you for the suggestion of object ID. By design, I understand
that object ID is more suitable for delete table use case, regarding
the requirement for monotonicity. I took a glance on HDDS-4315 and I
have one question.

By looking the code, the object ID seems to have the most significant
2 bits as epoch ID. But it's mostly implemented by Java's primitive
type of long, which is a signed integer. Thus, the max value of long
does not have the first bit as 0. That said, object IDs in epoch 2 and
3 are supposed to have negative value in long, and in that case,
monotonicity in integer comparison will be broken. I doubt if it's
safe in comparing object IDs. The comparison would only be safe by
encoding into binary or unsigned hex array - but it's not
straightforward and the comparison could be buggy IMO.
If the epoch is only supposed to range from 0 to 1, it would be safe.
Can we assume it, or is the comparison is always supposed to be safe?

> One thing I just want to say, we recommend HA or ratis enabled.

Thank you for the advice. Our cluster runs 1.1.0 but we explicitly
disabled Ratis when upgrading to 1.1 from 1.0. So I guess it's still
safe. Maybe enabling Ratis after upgrading to 1.2 would be safe
regarding the object ID issue in 1.1, if I understand correctly.

Thanks, and sorry for being late,
Kota

On Tue, Nov 2, 2021 at 2:12 AM Bharat Viswanadham
<bv...@cloudera.com.invalid> wrote:
>
> Hi Kota,
>
> >My question is that, is transaction index always available for non-HA
> >cluster?
>
> Yes, transaction index is available for non-HA also. But when you move from
> non-HA to non-HA the transaction index starts again from 0, as it is a
> newly setup cluster and ratis transaction index starts from 0 again. So, to
> avoid the issue of object ID's colliding, we have generated a unique Object
> ID based on transaction ID and also persisting transaction ID and starting
> from that after restarts(HDDS-4315). Maybe we can use ObjectID to not
> collide in an upgrade scenario from non-HA to HA here also.
>
> *Example Scenario *where it might cause problem using transaction index:
> (This is like a very theoretical example)
> Lets say 100 transaction Id delete key1 before upgrade
> Now 100 transaction id delete key1 after upgrade, we might miss block clean
> up. (Like the scenario described in HDDS-5905)
>
> Considering the above issue, I am thinking using transaction ID might be an
> issue, otherwise for HA/ratis enabled deployments for single nodes using
> transaction ID we should be good.
>
>
> One thing I just want to say, we recommend HA or ratis enabled. (As before
> HDDS-4315, we have a problem of generating transaction IDs from 0 again
> after a restart, which might not have unique object ID's in the cluster.
> And also we have enabled ratis enabled by default from 1.1.0 release (
> HDDS-4498 <https://issues.apache.org/jira/browse/HDDS-4498>)
>
>
>
> Thanks,
> Bharat
>
>
>
>
>
>
>
>
>
> Thanks,
> Bharat
>
>
>
> On Sun, Oct 31, 2021 at 5:29 PM Kota Uenishi <ko...@preferred.jp> wrote:
>
> > Thank you for the review, Lokesh and Bharat.
> >
> > I understand that transaction id would be better than timestamp,
> > especially because the computation cost of getting timestamp. In this
> > case, requirement for the sorting of deletion keys has not to be
> > strictly monotonic, but just mild monotonicity, like where clock skews
> > in the range of ours or days would be acceptable. I'll update the doc.
> >
> > My question is that, is transaction index always available for non-HA
> > cluster? For example, our 1.1.0 cluster is not using HA for OM nor for
> > SCM and we are not planning to upgrade to even
> > single-node Ratis (still using
> >
> > org.apache.hadoop.hdds.scm.pipeline.leader.choose.algorithms.DefaultLeaderChoosePolicy
> > for ozone.scm.pipeline.leader-choose.policy).
> >
> > Bharat, on RepeatedKeyInfo;
> > Yes, in my plan, RepeatedKeyInfo is still needed for data format
> > compatibility and I'm not planning to change proto. Especially,
> > changing proto format will make upgrade & downgrade extremely
> > difficult IMO. I know it doesn't have to be a list any more, but it's
> > just in theory.
> >
> > On Sat, Oct 30, 2021 at 4:45 AM Bharat Viswanadham
> > <bv...@cloudera.com.invalid> wrote:
> > >
> > > Hi Kota,
> > > Thanks for taking up HDDS-5905 and quickly coming up with a design.
> > >
> > > I liked the overall approach, but one thing instead of timestamps, I
> > agree
> > > with Lokesh, we can use transaction index, and also this will make
> > > implementation easy. (As with timestamp, we need to propagate this from
> > the
> > > leader, handle clock skews, and need to handle leader changes.
> > >
> > > And one question, so do we plan to use RepeatedKeyInfo, now with this
> > > change it will be no more list. You are not planning to change proto?
> > >
> > >
> > > Thanks,
> > > Bharat
> > >
> > >
> > > On Thu, Oct 28, 2021 at 11:12 PM Lokesh Jain <lj...@apache.org> wrote:
> > >
> > > > Hey Kota
> > > >
> > > > I really like the proposed approach because it makes sure that blocks
> > are
> > > > deleted in order of key deletion. I would suggest using Ratis
> > transaction
> > > > id as the prefix. I don’t think we will need a random suffix with that
> > > > approach as transaction id would avoid any collisions. Further it
> > avoid the
> > > > cost of generating timestamps.
> > > >
> > > > Thanks
> > > > Lokesh
> > > >
> > > > > On 29-Oct-2021, at 7:52 AM, Kota Uenishi <ko...@preferred.jp> wrote:
> > > > >
> > > > > Hi Bharat & devs,
> > > > >
> > > > > I've written up some of my idea to fix HDDS-5905, which is a
> > > > > block-leak issue mentioned by Bharat. It involves some data format
> > > > > change in deletion table, so I want to get broader range of feedback
> > > > > from committers in addition to Bharat. If it looks good to you, I
> > want
> > > > > to start writing up a patch. Please take a look!
> > > > >
> > > > > The proposal:
> > > >
> > https://docs.google.com/document/d/1KeyhiE1i5SqRSgLy-pIOGW9X6mUYb8iYEkEoDAEQD9Q/edit#
> > > > > HDDS-5905: https://issues.apache.org/jira/browse/HDDS-5905
> > > > >
> > > > > --
> > > > > --
> > > > > Kota UENISHI, Engineer
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
> > > > > For additional commands, e-mail: dev-help@ozone.apache.org
> > > > >
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
> > > > For additional commands, e-mail: dev-help@ozone.apache.org
> > > >
> > > >
> >
> >
> >
> > --
> > --
> > Kota UENISHI, Engineer
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
> > For additional commands, e-mail: dev-help@ozone.apache.org
> >
> >



-- 
--
Kota UENISHI, Engineer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
For additional commands, e-mail: dev-help@ozone.apache.org


Re: Design doc to fix HDDS-5905

Posted by Bharat Viswanadham <bv...@cloudera.com.INVALID>.
Hi Kota,

>My question is that, is transaction index always available for non-HA
>cluster?

Yes, transaction index is available for non-HA also. But when you move from
non-HA to non-HA the transaction index starts again from 0, as it is a
newly setup cluster and ratis transaction index starts from 0 again. So, to
avoid the issue of object ID's colliding, we have generated a unique Object
ID based on transaction ID and also persisting transaction ID and starting
from that after restarts(HDDS-4315). Maybe we can use ObjectID to not
collide in an upgrade scenario from non-HA to HA here also.

*Example Scenario *where it might cause problem using transaction index:
(This is like a very theoretical example)
Lets say 100 transaction Id delete key1 before upgrade
Now 100 transaction id delete key1 after upgrade, we might miss block clean
up. (Like the scenario described in HDDS-5905)

Considering the above issue, I am thinking using transaction ID might be an
issue, otherwise for HA/ratis enabled deployments for single nodes using
transaction ID we should be good.


One thing I just want to say, we recommend HA or ratis enabled. (As before
HDDS-4315, we have a problem of generating transaction IDs from 0 again
after a restart, which might not have unique object ID's in the cluster.
And also we have enabled ratis enabled by default from 1.1.0 release (
HDDS-4498 <https://issues.apache.org/jira/browse/HDDS-4498>)



Thanks,
Bharat









Thanks,
Bharat



On Sun, Oct 31, 2021 at 5:29 PM Kota Uenishi <ko...@preferred.jp> wrote:

> Thank you for the review, Lokesh and Bharat.
>
> I understand that transaction id would be better than timestamp,
> especially because the computation cost of getting timestamp. In this
> case, requirement for the sorting of deletion keys has not to be
> strictly monotonic, but just mild monotonicity, like where clock skews
> in the range of ours or days would be acceptable. I'll update the doc.
>
> My question is that, is transaction index always available for non-HA
> cluster? For example, our 1.1.0 cluster is not using HA for OM nor for
> SCM and we are not planning to upgrade to even
> single-node Ratis (still using
>
> org.apache.hadoop.hdds.scm.pipeline.leader.choose.algorithms.DefaultLeaderChoosePolicy
> for ozone.scm.pipeline.leader-choose.policy).
>
> Bharat, on RepeatedKeyInfo;
> Yes, in my plan, RepeatedKeyInfo is still needed for data format
> compatibility and I'm not planning to change proto. Especially,
> changing proto format will make upgrade & downgrade extremely
> difficult IMO. I know it doesn't have to be a list any more, but it's
> just in theory.
>
> On Sat, Oct 30, 2021 at 4:45 AM Bharat Viswanadham
> <bv...@cloudera.com.invalid> wrote:
> >
> > Hi Kota,
> > Thanks for taking up HDDS-5905 and quickly coming up with a design.
> >
> > I liked the overall approach, but one thing instead of timestamps, I
> agree
> > with Lokesh, we can use transaction index, and also this will make
> > implementation easy. (As with timestamp, we need to propagate this from
> the
> > leader, handle clock skews, and need to handle leader changes.
> >
> > And one question, so do we plan to use RepeatedKeyInfo, now with this
> > change it will be no more list. You are not planning to change proto?
> >
> >
> > Thanks,
> > Bharat
> >
> >
> > On Thu, Oct 28, 2021 at 11:12 PM Lokesh Jain <lj...@apache.org> wrote:
> >
> > > Hey Kota
> > >
> > > I really like the proposed approach because it makes sure that blocks
> are
> > > deleted in order of key deletion. I would suggest using Ratis
> transaction
> > > id as the prefix. I don’t think we will need a random suffix with that
> > > approach as transaction id would avoid any collisions. Further it
> avoid the
> > > cost of generating timestamps.
> > >
> > > Thanks
> > > Lokesh
> > >
> > > > On 29-Oct-2021, at 7:52 AM, Kota Uenishi <ko...@preferred.jp> wrote:
> > > >
> > > > Hi Bharat & devs,
> > > >
> > > > I've written up some of my idea to fix HDDS-5905, which is a
> > > > block-leak issue mentioned by Bharat. It involves some data format
> > > > change in deletion table, so I want to get broader range of feedback
> > > > from committers in addition to Bharat. If it looks good to you, I
> want
> > > > to start writing up a patch. Please take a look!
> > > >
> > > > The proposal:
> > >
> https://docs.google.com/document/d/1KeyhiE1i5SqRSgLy-pIOGW9X6mUYb8iYEkEoDAEQD9Q/edit#
> > > > HDDS-5905: https://issues.apache.org/jira/browse/HDDS-5905
> > > >
> > > > --
> > > > --
> > > > Kota UENISHI, Engineer
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
> > > > For additional commands, e-mail: dev-help@ozone.apache.org
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
> > > For additional commands, e-mail: dev-help@ozone.apache.org
> > >
> > >
>
>
>
> --
> --
> Kota UENISHI, Engineer
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
> For additional commands, e-mail: dev-help@ozone.apache.org
>
>