You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ozone.apache.org by Stephen O'Donnell <so...@cloudera.com.INVALID> on 2020/08/03 11:35:14 UTC

Topology: Non Rack Aware Pipelines

Hi All,

I would like to revisit network topology around pipeline creation and
destruction.

Pipelines are created by the RatisPipelineProvider which delegates
responsibility for picking the pipeline nodes to the
PipelinePlacementPolicy.

When picking the nodes for a pipeline, the PipelinePlacementPolicy will
check for the topology and presence of more than 1 rack, and if so, try to
create pipelines spanning multiple racks. Otherwise it will select random
nodes - this is the fall back mechanism, intended to be used by clusters
with a single rack, or no topology configured.

As I have raised before, we have a couple of problems:

1) On cluster startup, pipeline creation is triggered immediately when
nodes register. If at least 3 nodes from 1 rack register before any others,
they can be part of a pipeline which is not rack aware. We have somewhat
fixed this with safemode rules.

2) If the nodes per rack are not perfectly balanced, it would be possible
for 3 DNs in 1 rack to have capacity for more pipelines, with all other
nodes having no capacity. If that happens, the fallback mechanism would be
used, and a non-rack aware pipeline would be created.

3) If something happens such that only 1 rack is available for some time
(restart or rack outage) the cluster will create new pipelines on 1 rack,
and these will never get destroyed, even when the missing rack returns to
service.

Proposal 1:

If the cluster has multiple racks AND there are healthy nodes covering at
least 2 racks, where healthy is defined as a node which is registered and
not stale or dead, then we should not allow "fallback" (pipelines which
span only 1 rack) pipelines to be created.

This means if you have a badly configured cluster - eg Rack 1 = 10 nodes;
Rack 2 = 1 node, the pipeline limit will be constrained by the capacity of
that 1 node on rack 2. Even a setup like Rack 1 = 10 nodes, Rack 2 = 5
would be constrained by this.

IMO, this constraint is better than creating non rack aware pipelines, as
the cluster setup should be fixed.

This should also handle the case when the cluster degrades to 1 rack, as
the healthy node definition will notice only 1 rack is alive.

It would be quite easy to implement this in
PipelinePlacementPolicy#filterViableNodes, as we already get the list of
healthy nodes, and then exclude overloaded nodes.

Questions:

1. Pipeline creation does not consider capacity - do we need to consider
capacity in the "healthy node" definition? Eg, extend it to nodes which are
not stale or dead, and have X bytes of available space? What if no nodes
have enough space?

2. What happens to a pipeline if one node in the pipeline runs out of
space? Will this be detected and the pipeline destroyed?


Proposal 2:

In the PipelineManager, there is already a thread called the
BackgroundPipelineCreator. As well as creating pipelines, I think it should
check existing pipelines using a similar rule as proposal 1. Ie, if the
cluster has multiple racks, and there are healthy nodes spanning more than
1 rack, it should destroy non-rack aware pipelines.

This would handle the case where the cluster degrades to a single rack,
single rack pipelines get created, and then it returns to multi-rack. It
would also allow for any non rack aware pipelines created at startup to be
cleaned up.

Questions:

1. Should the pipeline destruction be throttled? Consider the case where
the cluster goes from 2 racks to 1 rack. All nodes on the remaining rack
will be involved in non-rack-aware pipelines up to their pipeline limit.
When the second rack comes back online, it will not be able to create any
pipelines, until we free capacity on the existing nodes.

2. Assuming the destruction is throttled, I would welcome some ideas about
metrics that can be used to throttle it, that will handle small to large
clusters. Perhaps something as simple as "destroy at least 1 and up to at
most X% of bad pipelines, then run createPipelines, sleep, repeat".

Note, that in a very small cluster - 6 nodes, 3 nodes per rack. If 1 rack
is down and its 3 nodes are in a pipeline - we cannot create a new pipeline
without briefly going to zero pipelines on the cluster.

I would like to get some agreement on the proposals before making any code
changes. Please let me know if there are any things I have missed or other
potential problems this could introduce.

Thanks,

Stephen.

Re: Topology: Non Rack Aware Pipelines

Posted by Stephen O'Donnell <so...@cloudera.com.INVALID>.

For proposal 1, I have created HDDS-4062 and a PR at
https://github.com/apache/hadoop-ozone/pull/1291

We had some discussion on the community call about closing pipelines
regularly after some time, and we should handle non rack aware pipelines in
that. It is hoped we can extend the existing Pipeline Scrubber (part of the
BackgroundPipelineCreator thread) to handle this. I will share a design doc
on HDDS-4065 in the near future on this, and we can move any discussion to
the doc / Jira rather than over email.

Thanks,

Stephen.


On Mon, Aug 3, 2020 at 12:35 PM Stephen O'Donnell <so...@cloudera.com>
wrote:

> Hi All,
>
> I would like to revisit network topology around pipeline creation and
> destruction.
>
> Pipelines are created by the RatisPipelineProvider which delegates
> responsibility for picking the pipeline nodes to the
> PipelinePlacementPolicy.
>
> When picking the nodes for a pipeline, the PipelinePlacementPolicy will
> check for the topology and presence of more than 1 rack, and if so, try to
> create pipelines spanning multiple racks. Otherwise it will select random
> nodes - this is the fall back mechanism, intended to be used by clusters
> with a single rack, or no topology configured.
>
> As I have raised before, we have a couple of problems:
>
> 1) On cluster startup, pipeline creation is triggered immediately when
> nodes register. If at least 3 nodes from 1 rack register before any others,
> they can be part of a pipeline which is not rack aware. We have somewhat
> fixed this with safemode rules.
>
> 2) If the nodes per rack are not perfectly balanced, it would be possible
> for 3 DNs in 1 rack to have capacity for more pipelines, with all other
> nodes having no capacity. If that happens, the fallback mechanism would be
> used, and a non-rack aware pipeline would be created.
>
> 3) If something happens such that only 1 rack is available for some time
> (restart or rack outage) the cluster will create new pipelines on 1 rack,
> and these will never get destroyed, even when the missing rack returns to
> service.
>
> Proposal 1:
>
> If the cluster has multiple racks AND there are healthy nodes covering at
> least 2 racks, where healthy is defined as a node which is registered and
> not stale or dead, then we should not allow "fallback" (pipelines which
> span only 1 rack) pipelines to be created.
>
> This means if you have a badly configured cluster - eg Rack 1 = 10 nodes;
> Rack 2 = 1 node, the pipeline limit will be constrained by the capacity of
> that 1 node on rack 2. Even a setup like Rack 1 = 10 nodes, Rack 2 = 5
> would be constrained by this.
>
> IMO, this constraint is better than creating non rack aware pipelines, as
> the cluster setup should be fixed.
>
> This should also handle the case when the cluster degrades to 1 rack, as
> the healthy node definition will notice only 1 rack is alive.
>
> It would be quite easy to implement this in
> PipelinePlacementPolicy#filterViableNodes, as we already get the list of
> healthy nodes, and then exclude overloaded nodes.
>
> Questions:
>
> 1. Pipeline creation does not consider capacity - do we need to consider
> capacity in the "healthy node" definition? Eg, extend it to nodes which are
> not stale or dead, and have X bytes of available space? What if no nodes
> have enough space?
>
> 2. What happens to a pipeline if one node in the pipeline runs out of
> space? Will this be detected and the pipeline destroyed?
>
>
> Proposal 2:
>
> In the PipelineManager, there is already a thread called the
> BackgroundPipelineCreator. As well as creating pipelines, I think it should
> check existing pipelines using a similar rule as proposal 1. Ie, if the
> cluster has multiple racks, and there are healthy nodes spanning more than
> 1 rack, it should destroy non-rack aware pipelines.
>
> This would handle the case where the cluster degrades to a single rack,
> single rack pipelines get created, and then it returns to multi-rack. It
> would also allow for any non rack aware pipelines created at startup to be
> cleaned up.
>
> Questions:
>
> 1. Should the pipeline destruction be throttled? Consider the case where
> the cluster goes from 2 racks to 1 rack. All nodes on the remaining rack
> will be involved in non-rack-aware pipelines up to their pipeline limit.
> When the second rack comes back online, it will not be able to create any
> pipelines, until we free capacity on the existing nodes.
>
> 2. Assuming the destruction is throttled, I would welcome some ideas about
> metrics that can be used to throttle it, that will handle small to large
> clusters. Perhaps something as simple as "destroy at least 1 and up to at
> most X% of bad pipelines, then run createPipelines, sleep, repeat".
>
> Note, that in a very small cluster - 6 nodes, 3 nodes per rack. If 1 rack
> is down and its 3 nodes are in a pipeline - we cannot create a new pipeline
> without briefly going to zero pipelines on the cluster.
>
> I would like to get some agreement on the proposals before making any code
> changes. Please let me know if there are any things I have missed or other
> potential problems this could introduce.
>
> Thanks,
>
> Stephen.
>
>

Re: Topology: Non Rack Aware Pipelines

Posted by Stephen O'Donnell <so...@cloudera.com.INVALID>.

Hi Sammi,

Thanks for the detailed reply.

After HDDS-3270 we can prevent pipelines from getting created in the
cluster until hdds.scm.safemode.min.datanode nodes have registered, so that
should largely avoid "bad" pipelines getting created on startup.

From a topology perspective, I am not too concerned about non-rack-aware
pipelines being created at startup time, so long as we can close those
pipelines later. If non-rack-aware pipelines get created and remain
forever, then it gives Replication Manager more work to do, as it must
replicate one copy of every closed container which is not rack aware.

It seems to be an open question about whether regularly closing pipelines
is a good idea. I have captured the details gathered from several people in
a short document and attached it to
https://issues.apache.org/jira/browse/HDDS-4065 so the information is all
in one place.

A less controversial question is whether we should close non-rack-aware
pipelines on a rack aware cluster. One suggestion I heard was to highlight
the problem in Recon and allow an admin to close the non-rack-aware
pipelines manually at a convenient time. This could work, but I feel it
would be better to handle this automatically. The longer it goes unhandled,
the more containers will need to be replicated to ensure data availability.

I agree that the fallback is important, but I believe it is a bug for the
fallback to be used when there are multiple racks alive. A small two rack
cluster is one corner case where this can happen, however for a large
cluster with 100's of pipelines, it is still possible for a non-rack-aware
pipeline to get created in some scenarios on a healthy cluster, and we
should avoid that - https://issues.apache.org/jira/browse/HDDS-4062 is
intended to fix that.

Thanks,

Stephen.


On Fri, Aug 7, 2020 at 10:37 AM Sammi Chen <sa...@apache.org> wrote:

> Hi Stephen,
>
> Thanks for initiating this.  Here are some of my thoughts based on in-field
> experience and discussion during today's community call.
>
> The goals we want to achieve are,
> a)  make sure container can tolerate rack level failure, so we need cross
> rack pipeline
> b)  make sure containers are evenly distributed, and container overlap
> between every three datanodes are minimized, then the unavailable data size
> once three data nodes go down at the same time is minimized.
>
> Given the current state is
> 1.   pipeline creation and healthy pipeline destruction are expensive
> operations, it's better to avoid unnecessary operations as much as
> possible.
> 2.   massive container replication is an expensive operation too.
>
> Let's have a case by case analysis,
> 1.  For a fresh new cluster, SCM will exit safe mode very soon after start
> up.  Most pipelines are created with the same group members and same
> datanode as the ratis leader. As long as there are 3 healthy nodes
> available, background thread starts to create pipelines(In this case, even
> HDDS-4062 cannot alleviate it).  A CLI command, just like current
> ReplicationManage, to turn on/off the background thread creation, seems a
> good choice.  For a fresh new cluster,  admin turns on the pipeline auto
> creation after enough data nodes are registered. Though this will add some
> burn to cluster admin, it's kind of acceptable.
>
> 2.  For a stably running cluster without datanode add/remove, periodically
> close pipeline is basically not preferred, because after the pipeline
> closed, you will get a new pipeline with the same datanodes as group
> members. But to achieve goal b), we have to periodically close the
> pipelines and recreate new pipelines.  The only thing I worry about is we
> will have more partial full containers. We need "HDDS-3952. Container
> Compact" .  The pipeline close frequency is very important.
>
> 3.  When expanding a cluster by adding a batch of new datanodes, with the
> close pipeline mechanism, the key point is the frequency.  If we want to
> distribute the write request quickly to the new datanodes, the short
> interval is preferred. But again, short interval means more partial full
> containers.
>
> 4.  When downgrading a cluster, because of network partition, rack power
> failure or old servers recycle,  pipelines on these involved datanodes will
> be closed after timeout.  But whether new pipelines can be created out is a
> question.  So it's a trade off,  fallback vs not fallback,  write
> throughput & availability vs data safety.  Basically，I think all these
> cases which cause cluster downgrading usallly will be fixed in a short
> time. So fallback is more user friendly and data safety risk is relatively
> low.
>
> Bests,
> Sammi
>
>
> On Mon, Aug 3, 2020 at 7:35 PM Stephen O'Donnell
> <so...@cloudera.com.invalid> wrote:
>
> > Hi All,
> >
> > I would like to revisit network topology around pipeline creation and
> > destruction.
> >
> > Pipelines are created by the RatisPipelineProvider which delegates
> > responsibility for picking the pipeline nodes to the
> > PipelinePlacementPolicy.
> >
> > When picking the nodes for a pipeline, the PipelinePlacementPolicy will
> > check for the topology and presence of more than 1 rack, and if so, try
> to
> > create pipelines spanning multiple racks. Otherwise it will select random
> > nodes - this is the fall back mechanism, intended to be used by clusters
> > with a single rack, or no topology configured.
> >
> > As I have raised before, we have a couple of problems:
> >
> > 1) On cluster startup, pipeline creation is triggered immediately when
> > nodes register. If at least 3 nodes from 1 rack register before any
> others,
> > they can be part of a pipeline which is not rack aware. We have somewhat
> > fixed this with safemode rules.
> >
> > 2) If the nodes per rack are not perfectly balanced, it would be possible
> > for 3 DNs in 1 rack to have capacity for more pipelines, with all other
> > nodes having no capacity. If that happens, the fallback mechanism would
> be
> > used, and a non-rack aware pipeline would be created.
> >
> > 3) If something happens such that only 1 rack is available for some time
> > (restart or rack outage) the cluster will create new pipelines on 1 rack,
> > and these will never get destroyed, even when the missing rack returns to
> > service.
> >
> > Proposal 1:
> >
> > If the cluster has multiple racks AND there are healthy nodes covering at
> > least 2 racks, where healthy is defined as a node which is registered and
> > not stale or dead, then we should not allow "fallback" (pipelines which
> > span only 1 rack) pipelines to be created.
> >
> > This means if you have a badly configured cluster - eg Rack 1 = 10 nodes;
> > Rack 2 = 1 node, the pipeline limit will be constrained by the capacity
> of
> > that 1 node on rack 2. Even a setup like Rack 1 = 10 nodes, Rack 2 = 5
> > would be constrained by this.
> >
> > IMO, this constraint is better than creating non rack aware pipelines, as
> > the cluster setup should be fixed.
> >
> > This should also handle the case when the cluster degrades to 1 rack, as
> > the healthy node definition will notice only 1 rack is alive.
> >
> > It would be quite easy to implement this in
> > PipelinePlacementPolicy#filterViableNodes, as we already get the list of
> > healthy nodes, and then exclude overloaded nodes.
> >
> > Questions:
> >
> > 1. Pipeline creation does not consider capacity - do we need to consider
> > capacity in the "healthy node" definition? Eg, extend it to nodes which
> are
> > not stale or dead, and have X bytes of available space? What if no nodes
> > have enough space?
> >
> > 2. What happens to a pipeline if one node in the pipeline runs out of
> > space? Will this be detected and the pipeline destroyed?
> >
> >
> > Proposal 2:
> >
> > In the PipelineManager, there is already a thread called the
> > BackgroundPipelineCreator. As well as creating pipelines, I think it
> should
> > check existing pipelines using a similar rule as proposal 1. Ie, if the
> > cluster has multiple racks, and there are healthy nodes spanning more
> than
> > 1 rack, it should destroy non-rack aware pipelines.
> >
> > This would handle the case where the cluster degrades to a single rack,
> > single rack pipelines get created, and then it returns to multi-rack. It
> > would also allow for any non rack aware pipelines created at startup to
> be
> > cleaned up.
> >
> > Questions:
> >
> > 1. Should the pipeline destruction be throttled? Consider the case where
> > the cluster goes from 2 racks to 1 rack. All nodes on the remaining rack
> > will be involved in non-rack-aware pipelines up to their pipeline limit.
> > When the second rack comes back online, it will not be able to create any
> > pipelines, until we free capacity on the existing nodes.
> >
> > 2. Assuming the destruction is throttled, I would welcome some ideas
> about
> > metrics that can be used to throttle it, that will handle small to large
> > clusters. Perhaps something as simple as "destroy at least 1 and up to at
> > most X% of bad pipelines, then run createPipelines, sleep, repeat".
> >
> > Note, that in a very small cluster - 6 nodes, 3 nodes per rack. If 1 rack
> > is down and its 3 nodes are in a pipeline - we cannot create a new
> pipeline
> > without briefly going to zero pipelines on the cluster.
> >
> > I would like to get some agreement on the proposals before making any
> code
> > changes. Please let me know if there are any things I have missed or
> other
> > potential problems this could introduce.
> >
> > Thanks,
> >
> > Stephen.
> >
>

Re: Topology: Non Rack Aware Pipelines

Posted by Sammi Chen <sa...@apache.org>.

Hi Stephen,

Thanks for initiating this.  Here are some of my thoughts based on in-field
experience and discussion during today's community call.

The goals we want to achieve are,
a)  make sure container can tolerate rack level failure, so we need cross
rack pipeline
b)  make sure containers are evenly distributed, and container overlap
between every three datanodes are minimized, then the unavailable data size
once three data nodes go down at the same time is minimized.

Given the current state is
1.   pipeline creation and healthy pipeline destruction are expensive
operations, it's better to avoid unnecessary operations as much as
possible.
2.   massive container replication is an expensive operation too.

Let's have a case by case analysis,
1.  For a fresh new cluster, SCM will exit safe mode very soon after start
up.  Most pipelines are created with the same group members and same
datanode as the ratis leader. As long as there are 3 healthy nodes
available, background thread starts to create pipelines(In this case, even
HDDS-4062 cannot alleviate it).  A CLI command, just like current
ReplicationManage, to turn on/off the background thread creation, seems a
good choice.  For a fresh new cluster,  admin turns on the pipeline auto
creation after enough data nodes are registered. Though this will add some
burn to cluster admin, it's kind of acceptable.

2.  For a stably running cluster without datanode add/remove, periodically
close pipeline is basically not preferred, because after the pipeline
closed, you will get a new pipeline with the same datanodes as group
members. But to achieve goal b), we have to periodically close the
pipelines and recreate new pipelines.  The only thing I worry about is we
will have more partial full containers. We need "HDDS-3952. Container
Compact" .  The pipeline close frequency is very important.

3.  When expanding a cluster by adding a batch of new datanodes, with the
close pipeline mechanism, the key point is the frequency.  If we want to
distribute the write request quickly to the new datanodes, the short
interval is preferred. But again, short interval means more partial full
containers.

4.  When downgrading a cluster, because of network partition, rack power
failure or old servers recycle,  pipelines on these involved datanodes will
be closed after timeout.  But whether new pipelines can be created out is a
question.  So it's a trade off,  fallback vs not fallback,  write
throughput & availability vs data safety.  Basically，I think all these
cases which cause cluster downgrading usallly will be fixed in a short
time. So fallback is more user friendly and data safety risk is relatively
low.

Bests,
Sammi

On Mon, Aug 3, 2020 at 7:35 PM Stephen O'Donnell
<so...@cloudera.com.invalid> wrote:

> Hi All,
>
> I would like to revisit network topology around pipeline creation and
> destruction.
>
> Pipelines are created by the RatisPipelineProvider which delegates
> responsibility for picking the pipeline nodes to the
> PipelinePlacementPolicy.
>
> When picking the nodes for a pipeline, the PipelinePlacementPolicy will
> check for the topology and presence of more than 1 rack, and if so, try to
> create pipelines spanning multiple racks. Otherwise it will select random
> nodes - this is the fall back mechanism, intended to be used by clusters
> with a single rack, or no topology configured.
>
> As I have raised before, we have a couple of problems:
>
> 1) On cluster startup, pipeline creation is triggered immediately when
> nodes register. If at least 3 nodes from 1 rack register before any others,
> they can be part of a pipeline which is not rack aware. We have somewhat
> fixed this with safemode rules.
>
> 2) If the nodes per rack are not perfectly balanced, it would be possible
> for 3 DNs in 1 rack to have capacity for more pipelines, with all other
> nodes having no capacity. If that happens, the fallback mechanism would be
> used, and a non-rack aware pipeline would be created.
>
> 3) If something happens such that only 1 rack is available for some time
> (restart or rack outage) the cluster will create new pipelines on 1 rack,
> and these will never get destroyed, even when the missing rack returns to
> service.
>
> Proposal 1:
>
> If the cluster has multiple racks AND there are healthy nodes covering at
> least 2 racks, where healthy is defined as a node which is registered and
> not stale or dead, then we should not allow "fallback" (pipelines which
> span only 1 rack) pipelines to be created.
>
> This means if you have a badly configured cluster - eg Rack 1 = 10 nodes;
> Rack 2 = 1 node, the pipeline limit will be constrained by the capacity of
> that 1 node on rack 2. Even a setup like Rack 1 = 10 nodes, Rack 2 = 5
> would be constrained by this.
>
> IMO, this constraint is better than creating non rack aware pipelines, as
> the cluster setup should be fixed.
>
> This should also handle the case when the cluster degrades to 1 rack, as
> the healthy node definition will notice only 1 rack is alive.
>
> It would be quite easy to implement this in
> PipelinePlacementPolicy#filterViableNodes, as we already get the list of
> healthy nodes, and then exclude overloaded nodes.
>
> Questions:
>
> 1. Pipeline creation does not consider capacity - do we need to consider
> capacity in the "healthy node" definition? Eg, extend it to nodes which are
> not stale or dead, and have X bytes of available space? What if no nodes
> have enough space?
>
> 2. What happens to a pipeline if one node in the pipeline runs out of
> space? Will this be detected and the pipeline destroyed?
>
>
> Proposal 2:
>
> In the PipelineManager, there is already a thread called the
> BackgroundPipelineCreator. As well as creating pipelines, I think it should
> check existing pipelines using a similar rule as proposal 1. Ie, if the
> cluster has multiple racks, and there are healthy nodes spanning more than
> 1 rack, it should destroy non-rack aware pipelines.
>
> This would handle the case where the cluster degrades to a single rack,
> single rack pipelines get created, and then it returns to multi-rack. It
> would also allow for any non rack aware pipelines created at startup to be
> cleaned up.
>
> Questions:
>
> 1. Should the pipeline destruction be throttled? Consider the case where
> the cluster goes from 2 racks to 1 rack. All nodes on the remaining rack
> will be involved in non-rack-aware pipelines up to their pipeline limit.
> When the second rack comes back online, it will not be able to create any
> pipelines, until we free capacity on the existing nodes.
>
> 2. Assuming the destruction is throttled, I would welcome some ideas about
> metrics that can be used to throttle it, that will handle small to large
> clusters. Perhaps something as simple as "destroy at least 1 and up to at
> most X% of bad pipelines, then run createPipelines, sleep, repeat".
>
> Note, that in a very small cluster - 6 nodes, 3 nodes per rack. If 1 rack
> is down and its 3 nodes are in a pipeline - we cannot create a new pipeline
> without briefly going to zero pipelines on the cluster.
>
> I would like to get some agreement on the proposals before making any code
> changes. Please let me know if there are any things I have missed or other
> potential problems this could introduce.
>
> Thanks,
>
> Stephen.
>