You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ozone.apache.org by "Elek, Marton" <el...@apache.org> on 2020/08/25 12:22:44 UTC

Re: Ozone non-rolling upgrades

Bumping this thread.

If you have any opinion, please let me know.

Thanks a lot,
Marton




On 6/26/20 2:51 PM, Elek, Marton wrote:
> 
> Thanks you very much to work on this Aravindan.
> 
> Finally, I collected my thoughts about the proposal.
> 
> First of or, I really like the concept in general, and I like the style 
> the documentation. It clearly explains a lot of existing behavior of 
> Ozone to make it easier to understand the problems.
> 
> I like the the abstraction of Software Layout Version vs. Metadata 
> Layout Version
> 
> I have some comments, but most of them are about technical details (not 
> about the concept itself). And they are questions and ideas not strong 
> opinions.
> 
> 1. On-line upgrade vs offline-upgrade
> 
> There is an option to do the upgrade offline: instead of calling an RPC, 
> executing a CLI.
> 
> a) for online upgrade we need to introduce a very specific running mode 
> which means that nobody can use the cluster (or just in read only mode?) 
> until the server is "finalized"
> 
> b) CLI can do any migration and upgrade the MLV inside database. The 
> only question is the old / peristed data in raft log, but IMHO it 
> shouldn't be a problem:
> 
>   1. we should commit the MLV upgrade with a raft transaction anyway
>   2. ratis log entries like client calls, and we supposed to be backward 
> compatible with old clients
> 
> I am not sure if the CLI approach is better (it seems to be more simple 
> for me) but at least we can compare the two approaches and explain why 
> do we prefer the RPC based method (if that is the better)
> 
> 2. I had an interesting conversation about why HDFS clusters are not 
> upgraded to Hadoop 3 and got some thoughts.
> 
> This document propose to always use the same version from SCM and 
> datanode which makes it simple.
> 
> I agree that it simplifies our job, but I think It can make the upgrade 
> harder. Especially for a 1-2000 node cluster.
> 
> After the storage-class proposal I have a different mental model:
> 
>   I think there can be different type of containers with different 
> replication strategies. Containers are classified with storage-class and 
> storage-class defines the container replication type.
> 
> In this model it's very easy to imagine that different datanodes can 
> support different replication type (or replication version).
> 
> Let's say I have 1000 nodes and I upgrade 500 of them to a specific 
> datanode version which can support EC container. SCM can easily manage 
> this problem if it's already prepared to support different type of 
> containers / replications (which is our goal, IMHO) based on node 
> capabilities.
> 
> In this model it should be easy to enable independent upgrade of 
> datanodes which can make it way more easier to upgrade a big cluster. 
> (but I agree to require OM/SCM/RECON upgrade at the same time)
> 
> 
> What do you think about this?
> 
> 
> 3. Finalize
> 
> Personally I don't like the "finalize" word. It suggests that we have an 
> upgrade process which can be "finalized", but in fact we don't have such 
> process. We start do any work AFTER the finalize button is pushed.
> 
> I know that it comes from the HDFS history, but I would prefer to use a 
> more generic and expressive words. (For example: jar/binary upgrade vs. 
> metadata upgrade).
> 
> At the end I learned what finally means (thanks to your patient 
> explanation during offline conversation ;-) ), but we can make the 
> understanding easier for next users.
> 
> 4. During you presentation you talked about the downgrade/rollback. I 
> felt that there could be a lot of tricky corner cases related to ratis + 
> snapshot. As a concept I like it (but my 2nd point is more important for 
> me, if possible), but I think we will see tricky technical problems on 
> the code level.
> 
> 
> Thanks again the great work,
> Marton
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: ozone-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: ozone-dev-help@hadoop.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-dev-help@hadoop.apache.org

Re: Ozone non-rolling upgrades

Posted by "Elek, Marton" <el...@apache.org>.

Thanks

On 8/25/20 7:05 PM, Aravindan Vijayan wrote:
> Hi Marton,
> 
> Thanks for the questions. Answers below.
> 
> *On-line upgrade vs offline-upgrade*
> The "Pre-Finalized" state is not meant to be a Read only state in Ozone.
> All existing Read/Write APIs will be allowed since they are guaranteed to
> be backward compatible. The only APIs that will not be allowed before
> finalization are those that are new or those that caused a layout change.
> For example, create EC file, Truncate etc. Hence, this is not really an
> "online" upgrade.
>

Yes, I understand the Pre-Finalize state (MLV!=SLV), but there is a 
process which starts when the user is happy with current cluster 
(upgrade MLV version). I think this is what is called "finalize"

This supposed to be a short process and there are two options:

1. Do it while the cluster is running

PRO:
  * You can read the cluster during the upgrade
  * Easier to check version of multiple services
  * Easier to integrate with cluster managers (argument from your answer)
  * Can support longer upgrade process if required (cluster can be 
available during an upgrade process)

CON:
  * More complex as we need a logic to disable some of the write methods
  * More complex as as it requires implementing new RPC servers and we 
should

2. Do it when the cluster is stopped

  PRO:
    * Very easy to implement, nothing more just updating state on disk.
    * More easy to keep data consistent as full re-start will happen 
after the upgrade

  CON:
    * Full service outage

I think the simplicity of the second option is a very tempting, and I 
was interested about your opinion about the PRO/CON of different option, 
to understand the reasons why we choose 1st

(I got the answer, and they are fair arguments. I tried to add all of 
the mentioned ones but if you have any more, please add it).

> *Enable independent upgrade of datanodes which can make it way more easier
> to upgrade a big cluster.*

>  From the examples you have mentioned, I do see some advantages to
> supporting separate datanode upgrades. The logic we went with now is meant
> to be restrictive since it is a full non-rolling upgrade (master +
> workers). 

Thanks to explain it. It's good to know that it's something what we can 
support on (very) long-term

> 
> *Finalize*
> As mentioned earlier, the Pre-Finalized state is not a complete standstill
> state for Ozone. Only new features/APIs/layout changes will be disallowed
> until the user decides to Finalize.

Yeah, I got it. It is exactly the reason why I don't like the "finalize" 
word.

Because we have 3 states:

1. cluster is running with new SLV and old MLV (long time)
2. upgrading MLV version (short time)
3. upgraded cluster is running

When you ask to finalize, you ask to upgrade MLV and in fact we don't 
finish any task but start a new one (upgrade).

But I can accept it, I just expressed my concern that for me (non-native 
speaker) 'finalize' is not a natural word and don't express very well 
the technical process.

I would call it -- for example -- metadata-upgrade.

And

ozone upgrade := software-upgrade + metadata-upgrade

and the two upgrades can be separated to each other.

(If I understood well today we call the START of the metadata-upgrade to 
FINALIZE)

But I can live with it, just explained why is it strange for me.

Thanks again your answers, good to see it's moving forward.

Marton

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-dev-help@hadoop.apache.org

Re: Ozone non-rolling upgrades

Posted by "Lin, Yiqun" <yi...@ebay.com.INVALID>.

Thanks Vijayan. 
I will take a look for above mentioned PR, that's a good start for this, : ).

Thanks,
Yiqun

On 2020/8/27, 1:03 AM, "Aravindan Vijayan" <av...@cloudera.com.INVALID> wrote:

    External Email

    Hi Yiqun,

    Thanks for the question.

    For the first implementation, we will go with a smaller goal list first.
    > Track just MLV (Layout Version as seen in metadata) and SLV (Layout
    Version from Software). All layout changes (including apply transaction
    changes) will bump up the layout version.
    > There is only 1 layout version hierarchy for the HDDS layer (SCM + DN).
    This allows us to keep track of only OM-SCM LV compatibility for now.

    Once this is done, we can introduce the following one by one.
    > Concept of a software version (SV) which can be explicitly used to handle
    Raft log layout changes (apply transaction changes). This allows downgrades
    when there is ONLY an SV change, and no LV change.
    > Separate version hierarchy for SCM and DN to allow DNs of different
    layout versions to co-exist in the cluster.

    Please follow the commits in branch HDDS-3698-upgrade, where we are working
    on building the framework layer for upgrades. Some of this will be clearer
    as more PRs come out. A good place to start is
    https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fhadoop-ozone%2Fpull%2F1322%2F&amp;data=02%7C01%7Cyiqlin%40ebay.com%7C79dcfa94805843369c4d08d849e1e164%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637340581799880444&amp;sdata=uMpISEGthAjzqQR9xvLj1%2Fvnh7yZL4uWUFdSQDygS%2Bk%3D&amp;reserved=0
    <https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fhadoop-ozone%2Fpull%2F1322%2Ffiles&amp;data=02%7C01%7Cyiqlin%40ebay.com%7C79dcfa94805843369c4d08d849e1e164%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637340581799880444&amp;sdata=jwgqA9tlZGxm49YrQ84qbaC0APGq2OQ7UHuNr72JqEs%3D&amp;reserved=0>.

    On Wed, Aug 26, 2020 at 5:58 AM Lin, Yiqun <yi...@ebay.com.invalid> wrote:

    > Hi Aravindan Vijayan,
    >
    > I have one question for the metadata version concept here.
    > I noticied that we introduce the Metadata Layout Version (MLV) in the
    > latest design doc of this feature in HDDS-3698.
    >
    > >The Metadata Layout Version (MLV) of an ozone component represents
    > the layout version of its on disk structures.
    > ...Each SV has a corresponding MLV that it is compatible with.
    > As we know Ozone contains many components(SCM, OM, Datanode now), and it
    > exists some potential dependency of these component. For example, one new
    > DN MLV version relies one new SCM MLV be finalized and used. So how do we
    > plan to deal with this type case by using current <SV, compatible MLV> way.
    > Only check SV and its corresponding component compatible MLV seems not
    > enough.
    >
    > Thanks,
    > Yiqun
    >
    > On 2020/8/26, 1:05 AM, "Aravindan Vijayan" <av...@cloudera.com.INVALID>
    > wrote:
    >
    >     External Email
    >
    >     Hi Marton,
    >
    >     Thanks for the questions. Answers below.
    >
    >     *On-line upgrade vs offline-upgrade*
    >     The "Pre-Finalized" state is not meant to be a Read only state in
    > Ozone.
    >     All existing Read/Write APIs will be allowed since they are guaranteed
    > to
    >     be backward compatible. The only APIs that will not be allowed before
    >     finalization are those that are new or those that caused a layout
    > change.
    >     For example, create EC file, Truncate etc. Hence, this is not really an
    >     "online" upgrade.
    >
    >
    >     *Enable independent upgrade of datanodes which can make it way more
    > easier
    >     to upgrade a big cluster.*
    >     From the examples you have mentioned, I do see some advantages to
    >     supporting separate datanode upgrades. The logic we went with now is
    > meant
    >     to be restrictive since it is a full non-rolling upgrade (master +
    >     workers). However, keeping rolling upgrades in mind, we will implement
    > it
    >     in such a way that it can easily support the use case you mention in
    > the
    >     future. Instead of keeping 1 HDDS version, we can fork off the Datanode
    >     layout version separately, and maintain a code level compatibility
    > matrix
    >     between SCM and Datanodes in the future. That way, SCM can support
    >     Datanodes of multiple layout versions together, with the only
    > restriction
    >     that an active pipeline (Ratis/EC) can be created only with those of
    > the
    >     same layout version.
    >
    >
    >     *Finalize*
    >     As mentioned earlier, the Pre-Finalized state is not a complete
    > standstill
    >     state for Ozone. Only new features/APIs/layout changes will be
    > disallowed
    >     until the user decides to Finalize. This state will serve as an
    > "insurance"
    >     for the user (and the Ozone team) to allow downgrade to an older
    > version if
    >     basic compat is broken or there is a serious regression. The name
    >     "finalize" has been borrowed from HDFS world. IMHO, it is a more
    > intuitive
    >     user experience to have a CLI driven (in the case of a CM managed
    > cluster,
    >     it will be a clickable UI option) rather than the user restarting the
    >     cluster again with a specific config change (which is an Ozone internal
    >     detail) for layout update.
    >
    >     *During your presentation you talked about the downgrade/rollback. I
    > felt
    >     that there could be a lot of tricky corner cases related to ratis +
    >     snapshot. *
    >     *As a concept I like it (but my 2nd point is more important for me, if
    >     possible), but I think we will see tricky technical problems on the
    > code
    >     level.*
    >     Yes, with respect to Ratis, it will be a challenge to guarantee that
    > the
    >     same "version" of the code "applies the transaction" on all the 3 nodes
    >     during the upgrade. By doing the following, we can approach the problem
    >     > Handling Ratis request handling changes as layout changes
    >     > Tagging every Ratis request with the current layout version
    >     > Introducing a "factory" in the Ratis request handler which looks at
    > the
    >     version of the request from the log, and then supplies the correct
    >     implementation for that request.
    >     In the future, there is also a plan to move the handling of Ratis
    > request
    >     versioning to a separate version hierarchy than MLV/SLV. I will be
    > adding
    >     more details on the v2.0 doc that will be uploaded later this week to
    >     HDDS-3698.
    >
    >     On Tue, Aug 25, 2020 at 5:22 AM Elek, Marton <el...@apache.org> wrote:
    >
    >     >
    >     > Bumping this thread.
    >     >
    >     > If you have any opinion, please let me know.
    >     >
    >     > Thanks a lot,
    >     > Marton
    >     >
    >     >
    >     >
    >     >
    >     > On 6/26/20 2:51 PM, Elek, Marton wrote:
    >     > >
    >     > > Thanks you very much to work on this Aravindan.
    >     > >
    >     > > Finally, I collected my thoughts about the proposal.
    >     > >
    >     > > First of or, I really like the concept in general, and I like the
    > style
    >     > > the documentation. It clearly explains a lot of existing behavior
    > of
    >     > > Ozone to make it easier to understand the problems.
    >     > >
    >     > > I like the the abstraction of Software Layout Version vs. Metadata
    >     > > Layout Version
    >     > >
    >     > > I have some comments, but most of them are about technical details
    > (not
    >     > > about the concept itself). And they are questions and ideas not
    > strong
    >     > > opinions.
    >     > >
    >     > > 1. On-line upgrade vs offline-upgrade
    >     > >
    >     > > There is an option to do the upgrade offline: instead of calling
    > an RPC,
    >     > > executing a CLI.
    >     > >
    >     > > a) for online upgrade we need to introduce a very specific running
    > mode
    >     > > which means that nobody can use the cluster (or just in read only
    > mode?)
    >     > > until the server is "finalized"
    >     > >
    >     > > b) CLI can do any migration and upgrade the MLV inside database.
    > The
    >     > > only question is the old / peristed data in raft log, but IMHO it
    >     > > shouldn't be a problem:
    >     > >
    >     > >   1. we should commit the MLV upgrade with a raft transaction
    > anyway
    >     > >   2. ratis log entries like client calls, and we supposed to be
    > backward
    >     > > compatible with old clients
    >     > >
    >     > > I am not sure if the CLI approach is better (it seems to be more
    > simple
    >     > > for me) but at least we can compare the two approaches and explain
    > why
    >     > > do we prefer the RPC based method (if that is the better)
    >     > >
    >     > > 2. I had an interesting conversation about why HDFS clusters are
    > not
    >     > > upgraded to Hadoop 3 and got some thoughts.
    >     > >
    >     > > This document propose to always use the same version from SCM and
    >     > > datanode which makes it simple.
    >     > >
    >     > > I agree that it simplifies our job, but I think It can make the
    > upgrade
    >     > > harder. Especially for a 1-2000 node cluster.
    >     > >
    >     > > After the storage-class proposal I have a different mental model:
    >     > >
    >     > >   I think there can be different type of containers with different
    >     > > replication strategies. Containers are classified with
    > storage-class and
    >     > > storage-class defines the container replication type.
    >     > >
    >     > > In this model it's very easy to imagine that different datanodes
    > can
    >     > > support different replication type (or replication version).
    >     > >
    >     > > Let's say I have 1000 nodes and I upgrade 500 of them to a specific
    >     > > datanode version which can support EC container. SCM can easily
    > manage
    >     > > this problem if it's already prepared to support different type of
    >     > > containers / replications (which is our goal, IMHO) based on node
    >     > > capabilities.
    >     > >
    >     > > In this model it should be easy to enable independent upgrade of
    >     > > datanodes which can make it way more easier to upgrade a big
    > cluster.
    >     > > (but I agree to require OM/SCM/RECON upgrade at the same time)
    >     > >
    >     > >
    >     > > What do you think about this?
    >     > >
    >     > >
    >     > > 3. Finalize
    >     > >
    >     > > Personally I don't like the "finalize" word. It suggests that we
    > have an
    >     > > upgrade process which can be "finalized", but in fact we don't
    > have such
    >     > > process. We start do any work AFTER the finalize button is pushed.
    >     > >
    >     > > I know that it comes from the HDFS history, but I would prefer to
    > use a
    >     > > more generic and expressive words. (For example: jar/binary
    > upgrade vs.
    >     > > metadata upgrade).
    >     > >
    >     > > At the end I learned what finally means (thanks to your patient
    >     > > explanation during offline conversation ;-) ), but we can make the
    >     > > understanding easier for next users.
    >     > >
    >     > > 4. During you presentation you talked about the
    > downgrade/rollback. I
    >     > > felt that there could be a lot of tricky corner cases related to
    > ratis +
    >     > > snapshot. As a concept I like it (but my 2nd point is more
    > important for
    >     > > me, if possible), but I think we will see tricky technical
    > problems on
    >     > > the code level.
    >     > >
    >     > >
    >     > > Thanks again the great work,
    >     > > Marton
    >     > >
    >     > >
    > ---------------------------------------------------------------------
    >     > > To unsubscribe, e-mail: ozone-dev-unsubscribe@hadoop.apache.org
    >     > > For additional commands, e-mail: ozone-dev-help@hadoop.apache.org
    >     > >
    >     >
    >     > ---------------------------------------------------------------------
    >     > To unsubscribe, e-mail: ozone-dev-unsubscribe@hadoop.apache.org
    >     > For additional commands, e-mail: ozone-dev-help@hadoop.apache.org
    >     >
    >     >
    >
    >     --
    >     Thanks & Regards,
    >     Aravindan
    >
    >
    > ---------------------------------------------------------------------
    > To unsubscribe, e-mail: ozone-dev-unsubscribe@hadoop.apache.org
    > For additional commands, e-mail: ozone-dev-help@hadoop.apache.org
    >
    >

    -- 
    Thanks & Regards,
    Aravindan

Re: Ozone non-rolling upgrades

Posted by Aravindan Vijayan <av...@cloudera.com.INVALID>.

Hi Yiqun,

Thanks for the question.

For the first implementation, we will go with a smaller goal list first.
> Track just MLV (Layout Version as seen in metadata) and SLV (Layout
Version from Software). All layout changes (including apply transaction
changes) will bump up the layout version.
> There is only 1 layout version hierarchy for the HDDS layer (SCM + DN).
This allows us to keep track of only OM-SCM LV compatibility for now.

Once this is done, we can introduce the following one by one.
> Concept of a software version (SV) which can be explicitly used to handle
Raft log layout changes (apply transaction changes). This allows downgrades
when there is ONLY an SV change, and no LV change.
> Separate version hierarchy for SCM and DN to allow DNs of different
layout versions to co-exist in the cluster.

Please follow the commits in branch HDDS-3698-upgrade, where we are working
on building the framework layer for upgrades. Some of this will be clearer
as more PRs come out. A good place to start is
https://github.com/apache/hadoop-ozone/pull/1322/
<https://github.com/apache/hadoop-ozone/pull/1322/files>.

On Wed, Aug 26, 2020 at 5:58 AM Lin, Yiqun <yi...@ebay.com.invalid> wrote:

> Hi Aravindan Vijayan,
>
> I have one question for the metadata version concept here.
> I noticied that we introduce the Metadata Layout Version (MLV) in the
> latest design doc of this feature in HDDS-3698.
>
> >The Metadata Layout Version (MLV) of an ozone component represents
> the layout version of its on disk structures.
> ...Each SV has a corresponding MLV that it is compatible with.
> As we know Ozone contains many components(SCM, OM, Datanode now), and it
> exists some potential dependency of these component. For example, one new
> DN MLV version relies one new SCM MLV be finalized and used. So how do we
> plan to deal with this type case by using current <SV, compatible MLV> way.
> Only check SV and its corresponding component compatible MLV seems not
> enough.
>
> Thanks,
> Yiqun
>
> On 2020/8/26, 1:05 AM, "Aravindan Vijayan" <av...@cloudera.com.INVALID>
> wrote:
>
>     External Email
>
>     Hi Marton,
>
>     Thanks for the questions. Answers below.
>
>     *On-line upgrade vs offline-upgrade*
>     The "Pre-Finalized" state is not meant to be a Read only state in
> Ozone.
>     All existing Read/Write APIs will be allowed since they are guaranteed
> to
>     be backward compatible. The only APIs that will not be allowed before
>     finalization are those that are new or those that caused a layout
> change.
>     For example, create EC file, Truncate etc. Hence, this is not really an
>     "online" upgrade.
>
>
>     *Enable independent upgrade of datanodes which can make it way more
> easier
>     to upgrade a big cluster.*
>     From the examples you have mentioned, I do see some advantages to
>     supporting separate datanode upgrades. The logic we went with now is
> meant
>     to be restrictive since it is a full non-rolling upgrade (master +
>     workers). However, keeping rolling upgrades in mind, we will implement
> it
>     in such a way that it can easily support the use case you mention in
> the
>     future. Instead of keeping 1 HDDS version, we can fork off the Datanode
>     layout version separately, and maintain a code level compatibility
> matrix
>     between SCM and Datanodes in the future. That way, SCM can support
>     Datanodes of multiple layout versions together, with the only
> restriction
>     that an active pipeline (Ratis/EC) can be created only with those of
> the
>     same layout version.
>
>
>     *Finalize*
>     As mentioned earlier, the Pre-Finalized state is not a complete
> standstill
>     state for Ozone. Only new features/APIs/layout changes will be
> disallowed
>     until the user decides to Finalize. This state will serve as an
> "insurance"
>     for the user (and the Ozone team) to allow downgrade to an older
> version if
>     basic compat is broken or there is a serious regression. The name
>     "finalize" has been borrowed from HDFS world. IMHO, it is a more
> intuitive
>     user experience to have a CLI driven (in the case of a CM managed
> cluster,
>     it will be a clickable UI option) rather than the user restarting the
>     cluster again with a specific config change (which is an Ozone internal
>     detail) for layout update.
>
>     *During your presentation you talked about the downgrade/rollback. I
> felt
>     that there could be a lot of tricky corner cases related to ratis +
>     snapshot. *
>     *As a concept I like it (but my 2nd point is more important for me, if
>     possible), but I think we will see tricky technical problems on the
> code
>     level.*
>     Yes, with respect to Ratis, it will be a challenge to guarantee that
> the
>     same "version" of the code "applies the transaction" on all the 3 nodes
>     during the upgrade. By doing the following, we can approach the problem
>     > Handling Ratis request handling changes as layout changes
>     > Tagging every Ratis request with the current layout version
>     > Introducing a "factory" in the Ratis request handler which looks at
> the
>     version of the request from the log, and then supplies the correct
>     implementation for that request.
>     In the future, there is also a plan to move the handling of Ratis
> request
>     versioning to a separate version hierarchy than MLV/SLV. I will be
> adding
>     more details on the v2.0 doc that will be uploaded later this week to
>     HDDS-3698.
>
>     On Tue, Aug 25, 2020 at 5:22 AM Elek, Marton <el...@apache.org> wrote:
>
>     >
>     > Bumping this thread.
>     >
>     > If you have any opinion, please let me know.
>     >
>     > Thanks a lot,
>     > Marton
>     >
>     >
>     >
>     >
>     > On 6/26/20 2:51 PM, Elek, Marton wrote:
>     > >
>     > > Thanks you very much to work on this Aravindan.
>     > >
>     > > Finally, I collected my thoughts about the proposal.
>     > >
>     > > First of or, I really like the concept in general, and I like the
> style
>     > > the documentation. It clearly explains a lot of existing behavior
> of
>     > > Ozone to make it easier to understand the problems.
>     > >
>     > > I like the the abstraction of Software Layout Version vs. Metadata
>     > > Layout Version
>     > >
>     > > I have some comments, but most of them are about technical details
> (not
>     > > about the concept itself). And they are questions and ideas not
> strong
>     > > opinions.
>     > >
>     > > 1. On-line upgrade vs offline-upgrade
>     > >
>     > > There is an option to do the upgrade offline: instead of calling
> an RPC,
>     > > executing a CLI.
>     > >
>     > > a) for online upgrade we need to introduce a very specific running
> mode
>     > > which means that nobody can use the cluster (or just in read only
> mode?)
>     > > until the server is "finalized"
>     > >
>     > > b) CLI can do any migration and upgrade the MLV inside database.
> The
>     > > only question is the old / peristed data in raft log, but IMHO it
>     > > shouldn't be a problem:
>     > >
>     > >   1. we should commit the MLV upgrade with a raft transaction
> anyway
>     > >   2. ratis log entries like client calls, and we supposed to be
> backward
>     > > compatible with old clients
>     > >
>     > > I am not sure if the CLI approach is better (it seems to be more
> simple
>     > > for me) but at least we can compare the two approaches and explain
> why
>     > > do we prefer the RPC based method (if that is the better)
>     > >
>     > > 2. I had an interesting conversation about why HDFS clusters are
> not
>     > > upgraded to Hadoop 3 and got some thoughts.
>     > >
>     > > This document propose to always use the same version from SCM and
>     > > datanode which makes it simple.
>     > >
>     > > I agree that it simplifies our job, but I think It can make the
> upgrade
>     > > harder. Especially for a 1-2000 node cluster.
>     > >
>     > > After the storage-class proposal I have a different mental model:
>     > >
>     > >   I think there can be different type of containers with different
>     > > replication strategies. Containers are classified with
> storage-class and
>     > > storage-class defines the container replication type.
>     > >
>     > > In this model it's very easy to imagine that different datanodes
> can
>     > > support different replication type (or replication version).
>     > >
>     > > Let's say I have 1000 nodes and I upgrade 500 of them to a specific
>     > > datanode version which can support EC container. SCM can easily
> manage
>     > > this problem if it's already prepared to support different type of
>     > > containers / replications (which is our goal, IMHO) based on node
>     > > capabilities.
>     > >
>     > > In this model it should be easy to enable independent upgrade of
>     > > datanodes which can make it way more easier to upgrade a big
> cluster.
>     > > (but I agree to require OM/SCM/RECON upgrade at the same time)
>     > >
>     > >
>     > > What do you think about this?
>     > >
>     > >
>     > > 3. Finalize
>     > >
>     > > Personally I don't like the "finalize" word. It suggests that we
> have an
>     > > upgrade process which can be "finalized", but in fact we don't
> have such
>     > > process. We start do any work AFTER the finalize button is pushed.
>     > >
>     > > I know that it comes from the HDFS history, but I would prefer to
> use a
>     > > more generic and expressive words. (For example: jar/binary
> upgrade vs.
>     > > metadata upgrade).
>     > >
>     > > At the end I learned what finally means (thanks to your patient
>     > > explanation during offline conversation ;-) ), but we can make the
>     > > understanding easier for next users.
>     > >
>     > > 4. During you presentation you talked about the
> downgrade/rollback. I
>     > > felt that there could be a lot of tricky corner cases related to
> ratis +
>     > > snapshot. As a concept I like it (but my 2nd point is more
> important for
>     > > me, if possible), but I think we will see tricky technical
> problems on
>     > > the code level.
>     > >
>     > >
>     > > Thanks again the great work,
>     > > Marton
>     > >
>     > >
> ---------------------------------------------------------------------
>     > > To unsubscribe, e-mail: ozone-dev-unsubscribe@hadoop.apache.org
>     > > For additional commands, e-mail: ozone-dev-help@hadoop.apache.org
>     > >
>     >
>     > ---------------------------------------------------------------------
>     > To unsubscribe, e-mail: ozone-dev-unsubscribe@hadoop.apache.org
>     > For additional commands, e-mail: ozone-dev-help@hadoop.apache.org
>     >
>     >
>
>     --
>     Thanks & Regards,
>     Aravindan
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: ozone-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: ozone-dev-help@hadoop.apache.org
>
>

-- 
Thanks & Regards,
Aravindan

Re: Ozone non-rolling upgrades

Posted by "Lin, Yiqun" <yi...@ebay.com.INVALID>.

Hi Aravindan Vijayan,

I have one question for the metadata version concept here.
I noticied that we introduce the Metadata Layout Version (MLV) in the latest design doc of this feature in HDDS-3698.

>The Metadata Layout Version (MLV) of an ozone component represents the layout version of its on disk structures. 
...Each SV has a corresponding MLV that it is compatible with.
As we know Ozone contains many components(SCM, OM, Datanode now), and it exists some potential dependency of these component. For example, one new DN MLV version relies one new SCM MLV be finalized and used. So how do we plan to deal with this type case by using current <SV, compatible MLV> way. Only check SV and its corresponding component compatible MLV seems not enough.

Thanks,
Yiqun

On 2020/8/26, 1:05 AM, "Aravindan Vijayan" <av...@cloudera.com.INVALID> wrote:

    External Email

    Hi Marton,

    Thanks for the questions. Answers below.

    *On-line upgrade vs offline-upgrade*
    The "Pre-Finalized" state is not meant to be a Read only state in Ozone.
    All existing Read/Write APIs will be allowed since they are guaranteed to
    be backward compatible. The only APIs that will not be allowed before
    finalization are those that are new or those that caused a layout change.
    For example, create EC file, Truncate etc. Hence, this is not really an
    "online" upgrade.


    *Enable independent upgrade of datanodes which can make it way more easier
    to upgrade a big cluster.*
    From the examples you have mentioned, I do see some advantages to
    supporting separate datanode upgrades. The logic we went with now is meant
    to be restrictive since it is a full non-rolling upgrade (master +
    workers). However, keeping rolling upgrades in mind, we will implement it
    in such a way that it can easily support the use case you mention in the
    future. Instead of keeping 1 HDDS version, we can fork off the Datanode
    layout version separately, and maintain a code level compatibility matrix
    between SCM and Datanodes in the future. That way, SCM can support
    Datanodes of multiple layout versions together, with the only restriction
    that an active pipeline (Ratis/EC) can be created only with those of the
    same layout version.


    *Finalize*
    As mentioned earlier, the Pre-Finalized state is not a complete standstill
    state for Ozone. Only new features/APIs/layout changes will be disallowed
    until the user decides to Finalize. This state will serve as an "insurance"
    for the user (and the Ozone team) to allow downgrade to an older version if
    basic compat is broken or there is a serious regression. The name
    "finalize" has been borrowed from HDFS world. IMHO, it is a more intuitive
    user experience to have a CLI driven (in the case of a CM managed cluster,
    it will be a clickable UI option) rather than the user restarting the
    cluster again with a specific config change (which is an Ozone internal
    detail) for layout update.

    *During your presentation you talked about the downgrade/rollback. I felt
    that there could be a lot of tricky corner cases related to ratis +
    snapshot. *
    *As a concept I like it (but my 2nd point is more important for me, if
    possible), but I think we will see tricky technical problems on the code
    level.*
    Yes, with respect to Ratis, it will be a challenge to guarantee that the
    same "version" of the code "applies the transaction" on all the 3 nodes
    during the upgrade. By doing the following, we can approach the problem
    > Handling Ratis request handling changes as layout changes
    > Tagging every Ratis request with the current layout version
    > Introducing a "factory" in the Ratis request handler which looks at the
    version of the request from the log, and then supplies the correct
    implementation for that request.
    In the future, there is also a plan to move the handling of Ratis request
    versioning to a separate version hierarchy than MLV/SLV. I will be adding
    more details on the v2.0 doc that will be uploaded later this week to
    HDDS-3698.

    On Tue, Aug 25, 2020 at 5:22 AM Elek, Marton <el...@apache.org> wrote:

    >
    > Bumping this thread.
    >
    > If you have any opinion, please let me know.
    >
    > Thanks a lot,
    > Marton
    >
    >
    >
    >
    > On 6/26/20 2:51 PM, Elek, Marton wrote:
    > >
    > > Thanks you very much to work on this Aravindan.
    > >
    > > Finally, I collected my thoughts about the proposal.
    > >
    > > First of or, I really like the concept in general, and I like the style
    > > the documentation. It clearly explains a lot of existing behavior of
    > > Ozone to make it easier to understand the problems.
    > >
    > > I like the the abstraction of Software Layout Version vs. Metadata
    > > Layout Version
    > >
    > > I have some comments, but most of them are about technical details (not
    > > about the concept itself). And they are questions and ideas not strong
    > > opinions.
    > >
    > > 1. On-line upgrade vs offline-upgrade
    > >
    > > There is an option to do the upgrade offline: instead of calling an RPC,
    > > executing a CLI.
    > >
    > > a) for online upgrade we need to introduce a very specific running mode
    > > which means that nobody can use the cluster (or just in read only mode?)
    > > until the server is "finalized"
    > >
    > > b) CLI can do any migration and upgrade the MLV inside database. The
    > > only question is the old / peristed data in raft log, but IMHO it
    > > shouldn't be a problem:
    > >
    > >   1. we should commit the MLV upgrade with a raft transaction anyway
    > >   2. ratis log entries like client calls, and we supposed to be backward
    > > compatible with old clients
    > >
    > > I am not sure if the CLI approach is better (it seems to be more simple
    > > for me) but at least we can compare the two approaches and explain why
    > > do we prefer the RPC based method (if that is the better)
    > >
    > > 2. I had an interesting conversation about why HDFS clusters are not
    > > upgraded to Hadoop 3 and got some thoughts.
    > >
    > > This document propose to always use the same version from SCM and
    > > datanode which makes it simple.
    > >
    > > I agree that it simplifies our job, but I think It can make the upgrade
    > > harder. Especially for a 1-2000 node cluster.
    > >
    > > After the storage-class proposal I have a different mental model:
    > >
    > >   I think there can be different type of containers with different
    > > replication strategies. Containers are classified with storage-class and
    > > storage-class defines the container replication type.
    > >
    > > In this model it's very easy to imagine that different datanodes can
    > > support different replication type (or replication version).
    > >
    > > Let's say I have 1000 nodes and I upgrade 500 of them to a specific
    > > datanode version which can support EC container. SCM can easily manage
    > > this problem if it's already prepared to support different type of
    > > containers / replications (which is our goal, IMHO) based on node
    > > capabilities.
    > >
    > > In this model it should be easy to enable independent upgrade of
    > > datanodes which can make it way more easier to upgrade a big cluster.
    > > (but I agree to require OM/SCM/RECON upgrade at the same time)
    > >
    > >
    > > What do you think about this?
    > >
    > >
    > > 3. Finalize
    > >
    > > Personally I don't like the "finalize" word. It suggests that we have an
    > > upgrade process which can be "finalized", but in fact we don't have such
    > > process. We start do any work AFTER the finalize button is pushed.
    > >
    > > I know that it comes from the HDFS history, but I would prefer to use a
    > > more generic and expressive words. (For example: jar/binary upgrade vs.
    > > metadata upgrade).
    > >
    > > At the end I learned what finally means (thanks to your patient
    > > explanation during offline conversation ;-) ), but we can make the
    > > understanding easier for next users.
    > >
    > > 4. During you presentation you talked about the downgrade/rollback. I
    > > felt that there could be a lot of tricky corner cases related to ratis +
    > > snapshot. As a concept I like it (but my 2nd point is more important for
    > > me, if possible), but I think we will see tricky technical problems on
    > > the code level.
    > >
    > >
    > > Thanks again the great work,
    > > Marton
    > >
    > > ---------------------------------------------------------------------
    > > To unsubscribe, e-mail: ozone-dev-unsubscribe@hadoop.apache.org
    > > For additional commands, e-mail: ozone-dev-help@hadoop.apache.org
    > >
    >
    > ---------------------------------------------------------------------
    > To unsubscribe, e-mail: ozone-dev-unsubscribe@hadoop.apache.org
    > For additional commands, e-mail: ozone-dev-help@hadoop.apache.org
    >
    >

    -- 
    Thanks & Regards,
    Aravindan


---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-dev-help@hadoop.apache.org

Re: Ozone non-rolling upgrades

Posted by Aravindan Vijayan <av...@cloudera.com.INVALID>.

Hi Marton,

Thanks for the questions. Answers below.

*On-line upgrade vs offline-upgrade*
The "Pre-Finalized" state is not meant to be a Read only state in Ozone.
All existing Read/Write APIs will be allowed since they are guaranteed to
be backward compatible. The only APIs that will not be allowed before
finalization are those that are new or those that caused a layout change.
For example, create EC file, Truncate etc. Hence, this is not really an
"online" upgrade.


*Enable independent upgrade of datanodes which can make it way more easier
to upgrade a big cluster.*
From the examples you have mentioned, I do see some advantages to
supporting separate datanode upgrades. The logic we went with now is meant
to be restrictive since it is a full non-rolling upgrade (master +
workers). However, keeping rolling upgrades in mind, we will implement it
in such a way that it can easily support the use case you mention in the
future. Instead of keeping 1 HDDS version, we can fork off the Datanode
layout version separately, and maintain a code level compatibility matrix
between SCM and Datanodes in the future. That way, SCM can support
Datanodes of multiple layout versions together, with the only restriction
that an active pipeline (Ratis/EC) can be created only with those of the
same layout version.


*Finalize*
As mentioned earlier, the Pre-Finalized state is not a complete standstill
state for Ozone. Only new features/APIs/layout changes will be disallowed
until the user decides to Finalize. This state will serve as an "insurance"
for the user (and the Ozone team) to allow downgrade to an older version if
basic compat is broken or there is a serious regression. The name
"finalize" has been borrowed from HDFS world. IMHO, it is a more intuitive
user experience to have a CLI driven (in the case of a CM managed cluster,
it will be a clickable UI option) rather than the user restarting the
cluster again with a specific config change (which is an Ozone internal
detail) for layout update.

*During your presentation you talked about the downgrade/rollback. I felt
that there could be a lot of tricky corner cases related to ratis +
snapshot. *
*As a concept I like it (but my 2nd point is more important for me, if
possible), but I think we will see tricky technical problems on the code
level.*
Yes, with respect to Ratis, it will be a challenge to guarantee that the
same "version" of the code "applies the transaction" on all the 3 nodes
during the upgrade. By doing the following, we can approach the problem
> Handling Ratis request handling changes as layout changes
> Tagging every Ratis request with the current layout version
> Introducing a "factory" in the Ratis request handler which looks at the
version of the request from the log, and then supplies the correct
implementation for that request.
In the future, there is also a plan to move the handling of Ratis request
versioning to a separate version hierarchy than MLV/SLV. I will be adding
more details on the v2.0 doc that will be uploaded later this week to
HDDS-3698.

On Tue, Aug 25, 2020 at 5:22 AM Elek, Marton <el...@apache.org> wrote:

>
> Bumping this thread.
>
> If you have any opinion, please let me know.
>
> Thanks a lot,
> Marton
>
>
>
>
> On 6/26/20 2:51 PM, Elek, Marton wrote:
> >
> > Thanks you very much to work on this Aravindan.
> >
> > Finally, I collected my thoughts about the proposal.
> >
> > First of or, I really like the concept in general, and I like the style
> > the documentation. It clearly explains a lot of existing behavior of
> > Ozone to make it easier to understand the problems.
> >
> > I like the the abstraction of Software Layout Version vs. Metadata
> > Layout Version
> >
> > I have some comments, but most of them are about technical details (not
> > about the concept itself). And they are questions and ideas not strong
> > opinions.
> >
> > 1. On-line upgrade vs offline-upgrade
> >
> > There is an option to do the upgrade offline: instead of calling an RPC,
> > executing a CLI.
> >
> > a) for online upgrade we need to introduce a very specific running mode
> > which means that nobody can use the cluster (or just in read only mode?)
> > until the server is "finalized"
> >
> > b) CLI can do any migration and upgrade the MLV inside database. The
> > only question is the old / peristed data in raft log, but IMHO it
> > shouldn't be a problem:
> >
> >   1. we should commit the MLV upgrade with a raft transaction anyway
> >   2. ratis log entries like client calls, and we supposed to be backward
> > compatible with old clients
> >
> > I am not sure if the CLI approach is better (it seems to be more simple
> > for me) but at least we can compare the two approaches and explain why
> > do we prefer the RPC based method (if that is the better)
> >
> > 2. I had an interesting conversation about why HDFS clusters are not
> > upgraded to Hadoop 3 and got some thoughts.
> >
> > This document propose to always use the same version from SCM and
> > datanode which makes it simple.
> >
> > I agree that it simplifies our job, but I think It can make the upgrade
> > harder. Especially for a 1-2000 node cluster.
> >
> > After the storage-class proposal I have a different mental model:
> >
> >   I think there can be different type of containers with different
> > replication strategies. Containers are classified with storage-class and
> > storage-class defines the container replication type.
> >
> > In this model it's very easy to imagine that different datanodes can
> > support different replication type (or replication version).
> >
> > Let's say I have 1000 nodes and I upgrade 500 of them to a specific
> > datanode version which can support EC container. SCM can easily manage
> > this problem if it's already prepared to support different type of
> > containers / replications (which is our goal, IMHO) based on node
> > capabilities.
> >
> > In this model it should be easy to enable independent upgrade of
> > datanodes which can make it way more easier to upgrade a big cluster.
> > (but I agree to require OM/SCM/RECON upgrade at the same time)
> >
> >
> > What do you think about this?
> >
> >
> > 3. Finalize
> >
> > Personally I don't like the "finalize" word. It suggests that we have an
> > upgrade process which can be "finalized", but in fact we don't have such
> > process. We start do any work AFTER the finalize button is pushed.
> >
> > I know that it comes from the HDFS history, but I would prefer to use a
> > more generic and expressive words. (For example: jar/binary upgrade vs.
> > metadata upgrade).
> >
> > At the end I learned what finally means (thanks to your patient
> > explanation during offline conversation ;-) ), but we can make the
> > understanding easier for next users.
> >
> > 4. During you presentation you talked about the downgrade/rollback. I
> > felt that there could be a lot of tricky corner cases related to ratis +
> > snapshot. As a concept I like it (but my 2nd point is more important for
> > me, if possible), but I think we will see tricky technical problems on
> > the code level.
> >
> >
> > Thanks again the great work,
> > Marton
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: ozone-dev-unsubscribe@hadoop.apache.org
> > For additional commands, e-mail: ozone-dev-help@hadoop.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: ozone-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: ozone-dev-help@hadoop.apache.org
>
>

-- 
Thanks & Regards,
Aravindan