You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ozone.apache.org by Kota Uenishi <ko...@preferred.jp> on 2022/01/28 07:50:18 UTC

Update and design decision on HDDS-5905

Hi Ozone dev,

I once proposed fix for HDDS-5905, but it's been a while. Now our
cluster got stable after a few work and I've got time to resume my
work on HDDS-5905. - and I came up to face a design decision on key
formatting again, as I learned more in detail about Ozone internals.

Bharat once gave me an advice [1] to use object IDs instead of
transaction index (and instead of timestamps), to address restart and
cluster upgrade to Ratis. But it has a drawback on object overwrite
and I came up with another design choice. They are:

1. Use object IDs as a key in the delete table
Pros: object IDs are consistently used in OM and easy to pick up in
RocksDB batch.
Cons:
 - On objects being overwrite, object ID of the key is not updated,
while previous blocks
   of the overwritten key are eligible for deletion (see HDDS-5461 and
HDDS-5656).
   Under this condition, there are a race where blocks gets lost and
will never be
   collected. Example scenario is like:

key open  oid=1
key commit
key open (overwrite) oid=1’  #<= oid must be updated on overwrite, or
use update id
key delete oid=1
key commit
key delete oid=1’ (<= overwritten and previous block gets leaked)
deletion service deletes 1’

   This behavior should be changed as to assign new oid=2 on overwrite.
 - In addition to the need of this fix, blocks are deleted in the
order of key open,
   not in the order of key deletion. It's better than alphabetical
order, but not
   perfect.

2. Use update IDs as a key in the delete table
Pros: The design is cleaner and the order of block deletion will be correct.
Cons:
 - Currently, assignment of update IDs are somewhat fuzzy. In most places
   raw transaction index, in some places object ID is used as-is e.g. directory
   creation (See OMDirectoryCreateRequest.java).
 - A fix on the update ID assignment would be prefix them with epoch nubmer
   as well as object ID, but most part of setting update ID should be fixed.

I feel 1. is easier but a bit not correct, while 2 is more correct but
the required change is wide. I updated my proposal accordingly [2], so
please let me know your thoughts on which to choose. Also, my messy
working branch can be found here [3].

P.S. my fix on HDDS-5905 conflicts and depends on HDDS-5656, because
it's also about key deletion and overwrite. I want to get it reviewed
and merged beforehand. It's kinda leftover task from HDDS-5461 and
should be merged for 1.3.

[1] https://lists.apache.org/thread/79qgx598rv3qcojmzoxhc9ypkh1jj64y
[2] https://docs.google.com/document/d/1KeyhiE1i5SqRSgLy-pIOGW9X6mUYb8iYEkEoDAEQD9Q/edit#heading=h.nqxuhw78zsv7
[3] https://github.com/kuenishi/ozone/pull/1

-- 
--
Kota UENISHI, Engineer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
For additional commands, e-mail: dev-help@ozone.apache.org


Re: Update and design decision on HDDS-5905

Posted by Kota Uenishi <ko...@preferred.jp>.
Hi Prashant,

Thank you for the invitation. I'll be at the community meeting, Friday
sync. Is [1] the right venue and info to jump in?
Also, I'm already in #ozone channel of the Apache Slack, so feel free
to ask any questions there.

[1] https://cwiki.apache.org/confluence/display/OZONE/Ozone+Community+Calls

On Tue, Feb 1, 2022 at 1:44 AM Prashant Pogde
<pp...@cloudera.com.invalid> wrote:
>
> Hi Kota,
>
> I went through your proposal and it looks good.
> Let us discuss this in our next ozone community meeting as well. Let us connect on apache slack.
>
> Regards,
> Prashant
>
>
> > On Jan 27, 2022, at 11:50 PM, Kota Uenishi <ko...@preferred.jp> wrote:
> >
> > Hi Ozone dev,
> >
> > I once proposed fix for HDDS-5905, but it's been a while. Now our
> > cluster got stable after a few work and I've got time to resume my
> > work on HDDS-5905. - and I came up to face a design decision on key
> > formatting again, as I learned more in detail about Ozone internals.
> >
> > Bharat once gave me an advice [1] to use object IDs instead of
> > transaction index (and instead of timestamps), to address restart and
> > cluster upgrade to Ratis. But it has a drawback on object overwrite
> > and I came up with another design choice. They are:
> >
> > 1. Use object IDs as a key in the delete table
> > Pros: object IDs are consistently used in OM and easy to pick up in
> > RocksDB batch.
> > Cons:
> > - On objects being overwrite, object ID of the key is not updated,
> > while previous blocks
> >   of the overwritten key are eligible for deletion (see HDDS-5461 and
> > HDDS-5656).
> >   Under this condition, there are a race where blocks gets lost and
> > will never be
> >   collected. Example scenario is like:
> >
> > key open  oid=1
> > key commit
> > key open (overwrite) oid=1’  #<= oid must be updated on overwrite, or
> > use update id
> > key delete oid=1
> > key commit
> > key delete oid=1’ (<= overwritten and previous block gets leaked)
> > deletion service deletes 1’
> >
> >   This behavior should be changed as to assign new oid=2 on overwrite.
> > - In addition to the need of this fix, blocks are deleted in the
> > order of key open,
> >   not in the order of key deletion. It's better than alphabetical
> > order, but not
> >   perfect.
> >
> > 2. Use update IDs as a key in the delete table
> > Pros: The design is cleaner and the order of block deletion will be correct.
> > Cons:
> > - Currently, assignment of update IDs are somewhat fuzzy. In most places
> >   raw transaction index, in some places object ID is used as-is e.g. directory
> >   creation (See OMDirectoryCreateRequest.java).
> > - A fix on the update ID assignment would be prefix them with epoch nubmer
> >   as well as object ID, but most part of setting update ID should be fixed.
> >
> > I feel 1. is easier but a bit not correct, while 2 is more correct but
> > the required change is wide. I updated my proposal accordingly [2], so
> > please let me know your thoughts on which to choose. Also, my messy
> > working branch can be found here [3].
> >
> > P.S. my fix on HDDS-5905 conflicts and depends on HDDS-5656, because
> > it's also about key deletion and overwrite. I want to get it reviewed
> > and merged beforehand. It's kinda leftover task from HDDS-5461 and
> > should be merged for 1.3.
> >
> > [1] https://lists.apache.org/thread/79qgx598rv3qcojmzoxhc9ypkh1jj64y
> > [2] https://docs.google.com/document/d/1KeyhiE1i5SqRSgLy-pIOGW9X6mUYb8iYEkEoDAEQD9Q/edit#heading=h.nqxuhw78zsv7
> > [3] https://github.com/kuenishi/ozone/pull/1
> >
> > --
> > --
> > Kota UENISHI, Engineer
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
> > For additional commands, e-mail: dev-help@ozone.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
> For additional commands, e-mail: dev-help@ozone.apache.org
>


-- 
--
Kota UENISHI, Engineer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
For additional commands, e-mail: dev-help@ozone.apache.org


Re: Update and design decision on HDDS-5905

Posted by Prashant Pogde <pp...@cloudera.com.INVALID>.
Hi Kota,

I went through your proposal and it looks good.
Let us discuss this in our next ozone community meeting as well. Let us connect on apache slack.

Regards,
Prashant


> On Jan 27, 2022, at 11:50 PM, Kota Uenishi <ko...@preferred.jp> wrote:
> 
> Hi Ozone dev,
> 
> I once proposed fix for HDDS-5905, but it's been a while. Now our
> cluster got stable after a few work and I've got time to resume my
> work on HDDS-5905. - and I came up to face a design decision on key
> formatting again, as I learned more in detail about Ozone internals.
> 
> Bharat once gave me an advice [1] to use object IDs instead of
> transaction index (and instead of timestamps), to address restart and
> cluster upgrade to Ratis. But it has a drawback on object overwrite
> and I came up with another design choice. They are:
> 
> 1. Use object IDs as a key in the delete table
> Pros: object IDs are consistently used in OM and easy to pick up in
> RocksDB batch.
> Cons:
> - On objects being overwrite, object ID of the key is not updated,
> while previous blocks
>   of the overwritten key are eligible for deletion (see HDDS-5461 and
> HDDS-5656).
>   Under this condition, there are a race where blocks gets lost and
> will never be
>   collected. Example scenario is like:
> 
> key open  oid=1
> key commit
> key open (overwrite) oid=1’  #<= oid must be updated on overwrite, or
> use update id
> key delete oid=1
> key commit
> key delete oid=1’ (<= overwritten and previous block gets leaked)
> deletion service deletes 1’
> 
>   This behavior should be changed as to assign new oid=2 on overwrite.
> - In addition to the need of this fix, blocks are deleted in the
> order of key open,
>   not in the order of key deletion. It's better than alphabetical
> order, but not
>   perfect.
> 
> 2. Use update IDs as a key in the delete table
> Pros: The design is cleaner and the order of block deletion will be correct.
> Cons:
> - Currently, assignment of update IDs are somewhat fuzzy. In most places
>   raw transaction index, in some places object ID is used as-is e.g. directory
>   creation (See OMDirectoryCreateRequest.java).
> - A fix on the update ID assignment would be prefix them with epoch nubmer
>   as well as object ID, but most part of setting update ID should be fixed.
> 
> I feel 1. is easier but a bit not correct, while 2 is more correct but
> the required change is wide. I updated my proposal accordingly [2], so
> please let me know your thoughts on which to choose. Also, my messy
> working branch can be found here [3].
> 
> P.S. my fix on HDDS-5905 conflicts and depends on HDDS-5656, because
> it's also about key deletion and overwrite. I want to get it reviewed
> and merged beforehand. It's kinda leftover task from HDDS-5461 and
> should be merged for 1.3.
> 
> [1] https://lists.apache.org/thread/79qgx598rv3qcojmzoxhc9ypkh1jj64y
> [2] https://docs.google.com/document/d/1KeyhiE1i5SqRSgLy-pIOGW9X6mUYb8iYEkEoDAEQD9Q/edit#heading=h.nqxuhw78zsv7
> [3] https://github.com/kuenishi/ozone/pull/1
> 
> -- 
> --
> Kota UENISHI, Engineer
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
> For additional commands, e-mail: dev-help@ozone.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ozone.apache.org
For additional commands, e-mail: dev-help@ozone.apache.org