You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@trafficcontrol.apache.org by Nir Sopher <ni...@qwilt.com> on 2017/01/31 13:10:05 UTC

Streamlining TC management and operations sequences

Hi,

In order to further improve the simplicity and robustness of the control
path for provisioning infrastructure and delivery services, we are
currently considering ways to streamline management and operations.

Currently, when applying changes in traffic-control that require the
synchronization between the traffic-router and traffic-servers, the user
should be conscious to do so in a certain order. Otherwise, "black holes"
may be created. Furthermore, in some of the scenarios the user have to wait
and verify that the configuration reached all traffic server before he may
apply it to the traffic-router.

We have noticed that TC-3.0 is planned to include a "Config State Machine",
probably dealing with the issue thoroughly. We have no further information
about this bullet and would appreciate any additional info.

We would like to start investing in making TC operations more streamline,
robust and user-friendly.

The main use-cases we would like to address at this point are:

   1. Assign servers to a Delivery-Service.
   For this operation, the configuration must first be applied to the added
   traffic servers, propagate, and only then applied to the traffic-router.
   2. Remove servers assignment to a Delivery-Service.
   For this operation, the configuration must first be applied to the
   traffic-router, and only then to the traffic-servers.
   3. Add a new delivery service.
   This is practically a private case of servers assignment to a
   delivery-service.
   4. Delete a delivery service.
   This is practically a private case of servers assignment removal from a
   delivery-service.
   5. Update settings that must be applied together on the traffic servers
   and the router.

We would like to simplify the procedure, allowing the deployment of new
configuration in a single operation, instead of doing it step by step.

One solution can be based on the insight that deploying such configuration
changes may be done by initially updating the traffic server with added
functionality (e.g remap-rule), then updating the router, and lastly,
removing old functionality from the traffic servers. Such a solution can be
orchestrated by traffic-ops, probably without complicating other components.

Other solutions may provide more flexibility, but would probably involve
adding complexity to other components such as traffic-router.

We would be glad to hear the community's thoughts on the matter, so we can
take this further.

Thanks,
Nir

Re: Streamlining TC management and operations sequences

Posted by Nir Sopher <ni...@qwilt.com>.
Hi Eric,
Actually, as we imaged it, a "generation" is created only when a new
configuration is applied - when the "consistent hash" is permanently
modified.

I'll open a separate thread to discuss the technical details further,
including an algorithm we have in mind.

I also opened TC-130 - Streamlining TC management and operations sequences
<https://issues.apache.org/jira/browse/TC-130> to further monitor the issue.

Would appreciate community inputs about the issue, especially discussing
the PROs and CONs of the 2 different approaches:
Traffic Ops orchestrated solution vs. A more flexible, traffic-router
algorithm based, solution.

Nir




On Wed, Feb 1, 2017 at 3:33 PM, Eric Friedrich (efriedri) <
efriedri@cisco.com> wrote:

> Hey Nir-
>   Interesting thought for sure.
>
> Would TM “health changes” (loss of connectivity, BW/loadavg too high)
> change the generation count? It seems like the answer is Yes, because the
> health of a cache impacts the state of the consistent hash ring.
>
> If so, how do these generation changes get from the Traffic Monitor to the
> caches, when config changes typically come only from Traffic Ops and only
> when ORT is run?
>
> Or maybe the generation count is just an abstraction to conceptualize the
> problem space and not a literal approach?
>
> —Eric
>
> > On Feb 1, 2017, at 4:14 AM, Nir Sopher <ni...@qwilt.com> wrote:
> >
> > Hi Eric,
> >
> > Formalizing the approach you suggested, one may introduce the concept of
> a
> > delivery-service configuration "generation" which would be an ordinal
> > identifier for the a delivery service configuration. A "generation"
> changes
> > whenever the remap rule changes or the consistent hash mapping of content
> > to server changes (e.g. due to additional server assignment).
> > I such a solution, each traffic-server may hold a single generation for
> > each delivery service configuration, while traffic-router may hold a
> > history of generations and know which server holds which configuration
> > generation.
> >
> > This approach introduces a considerable flexibility. It allows
> > configurations to be set one after the other with no need to wait between
> > them.
> > It also fits well with Jeremy's suggestion for queue-update with a
> delivery
> > service granularity.
> >
> > On the other hand, complicated algorithms for solving the issue may
> impose
> > more risk to the network when applied, comparing to a simple
> "traffic-ops"
> > orchestrated solution.
> >
> > I'm not sure what is preferable from an operator point of view. I'm also
> > not familiar with TC 3.0 configuration solution to validate he different
> > approaches against.
> >
> > Please share your thoughts,
> > Thanks,
> > Nir
> >
> > On Tue, Jan 31, 2017 at 6:26 PM, Eric Friedrich (efriedri) <
> > efriedri@cisco.com> wrote:
> >
> >> What about an approach (apologies, still light on details), where TR
> >> (perhaps still via TM) discovers the availability of delivery services
> from
> >> the cache itself, rather than from the CRConfig file? (Astats or its
> >> remap_stats based replacement would publish its remap rules)
> >>
> >> Any changes to the set of servers (add/remove) or DS assignments would
> not
> >> require a specific step to push a changed config to the router. If a
> cache
> >> does not yet, or no longer has remap rules for a specific delivery
> service,
> >> then TR will not see that rule advertised by the cache and will not
> send it
> >> traffic. If adding or removing a server, TM still needs to be updated to
> >> learn about the new server.
> >>
> >> With current configuration, theres a race condition of a few seconds
> where
> >> a cache removes remap rule before TM polls and TR gets health info from
> TM.
> >> In these few seconds, TR would erroneously send traffic to a cache
> without
> >> a proper remap rule.
> >>
> >> We could fix this by
> >>  a) advertising a state of the remap rule in astats to notify TR no
> >> longer to send traffic on that DS for a short period before the rule is
> >> actually removed - all handled inside of ORT).
> >>    or
> >>  b) prematurely removing the remap rule from astats, before the config
> on
> >> TS is actually updated (at the cost of missing the final few remap stats
> >> numbers). This is probably unacceptable.
> >>
> >> I’m sure there are other variants on this, but my main goal is for TR to
> >> directly learn from the caches which delivery services they actually
> have
> >> available. Rather than the TR learning what TO only thinks each cache
> has
> >> available.
> >>
> >> —Eric
> >>
> >>
> >>
> >>
> >>
> >>> On Jan 31, 2017, at 8:10 AM, Nir Sopher <ni...@qwilt.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> In order to further improve the simplicity and robustness of the
> control
> >>> path for provisioning infrastructure and delivery services, we are
> >>> currently considering ways to streamline management and operations.
> >>>
> >>> Currently, when applying changes in traffic-control that require the
> >>> synchronization between the traffic-router and traffic-servers, the
> user
> >>> should be conscious to do so in a certain order. Otherwise, "black
> holes"
> >>> may be created. Furthermore, in some of the scenarios the user have to
> >> wait
> >>> and verify that the configuration reached all traffic server before he
> >> may
> >>> apply it to the traffic-router.
> >>>
> >>> We have noticed that TC-3.0 is planned to include a "Config State
> >> Machine",
> >>> probably dealing with the issue thoroughly. We have no further
> >> information
> >>> about this bullet and would appreciate any additional info.
> >>>
> >>> We would like to start investing in making TC operations more
> streamline,
> >>> robust and user-friendly.
> >>>
> >>> The main use-cases we would like to address at this point are:
> >>>
> >>>  1. Assign servers to a Delivery-Service.
> >>>  For this operation, the configuration must first be applied to the
> >> added
> >>>  traffic servers, propagate, and only then applied to the
> >> traffic-router.
> >>>  2. Remove servers assignment to a Delivery-Service.
> >>>  For this operation, the configuration must first be applied to the
> >>>  traffic-router, and only then to the traffic-servers.
> >>>  3. Add a new delivery service.
> >>>  This is practically a private case of servers assignment to a
> >>>  delivery-service.
> >>>  4. Delete a delivery service.
> >>>  This is practically a private case of servers assignment removal from
> a
> >>>  delivery-service.
> >>>  5. Update settings that must be applied together on the traffic
> servers
> >>>  and the router.
> >>>
> >>> We would like to simplify the procedure, allowing the deployment of new
> >>> configuration in a single operation, instead of doing it step by step.
> >>>
> >>> One solution can be based on the insight that deploying such
> >> configuration
> >>> changes may be done by initially updating the traffic server with added
> >>> functionality (e.g remap-rule), then updating the router, and lastly,
> >>> removing old functionality from the traffic servers. Such a solution
> can
> >> be
> >>> orchestrated by traffic-ops, probably without complicating other
> >> components.
> >>>
> >>> Other solutions may provide more flexibility, but would probably
> involve
> >>> adding complexity to other components such as traffic-router.
> >>>
> >>> We would be glad to hear the community's thoughts on the matter, so we
> >> can
> >>> take this further.
> >>>
> >>> Thanks,
> >>> Nir
> >>
> >>
>
>

Re: Streamlining TC management and operations sequences

Posted by "Eric Friedrich (efriedri)" <ef...@cisco.com>.
Hey Nir-
  Interesting thought for sure. 

Would TM “health changes” (loss of connectivity, BW/loadavg too high) change the generation count? It seems like the answer is Yes, because the health of a cache impacts the state of the consistent hash ring. 

If so, how do these generation changes get from the Traffic Monitor to the caches, when config changes typically come only from Traffic Ops and only when ORT is run? 

Or maybe the generation count is just an abstraction to conceptualize the problem space and not a literal approach? 

—Eric
  
> On Feb 1, 2017, at 4:14 AM, Nir Sopher <ni...@qwilt.com> wrote:
> 
> Hi Eric,
> 
> Formalizing the approach you suggested, one may introduce the concept of a
> delivery-service configuration "generation" which would be an ordinal
> identifier for the a delivery service configuration. A "generation" changes
> whenever the remap rule changes or the consistent hash mapping of content
> to server changes (e.g. due to additional server assignment).
> I such a solution, each traffic-server may hold a single generation for
> each delivery service configuration, while traffic-router may hold a
> history of generations and know which server holds which configuration
> generation.
> 
> This approach introduces a considerable flexibility. It allows
> configurations to be set one after the other with no need to wait between
> them.
> It also fits well with Jeremy's suggestion for queue-update with a delivery
> service granularity.
> 
> On the other hand, complicated algorithms for solving the issue may impose
> more risk to the network when applied, comparing to a simple "traffic-ops"
> orchestrated solution.
> 
> I'm not sure what is preferable from an operator point of view. I'm also
> not familiar with TC 3.0 configuration solution to validate he different
> approaches against.
> 
> Please share your thoughts,
> Thanks,
> Nir
> 
> On Tue, Jan 31, 2017 at 6:26 PM, Eric Friedrich (efriedri) <
> efriedri@cisco.com> wrote:
> 
>> What about an approach (apologies, still light on details), where TR
>> (perhaps still via TM) discovers the availability of delivery services from
>> the cache itself, rather than from the CRConfig file? (Astats or its
>> remap_stats based replacement would publish its remap rules)
>> 
>> Any changes to the set of servers (add/remove) or DS assignments would not
>> require a specific step to push a changed config to the router. If a cache
>> does not yet, or no longer has remap rules for a specific delivery service,
>> then TR will not see that rule advertised by the cache and will not send it
>> traffic. If adding or removing a server, TM still needs to be updated to
>> learn about the new server.
>> 
>> With current configuration, theres a race condition of a few seconds where
>> a cache removes remap rule before TM polls and TR gets health info from TM.
>> In these few seconds, TR would erroneously send traffic to a cache without
>> a proper remap rule.
>> 
>> We could fix this by
>>  a) advertising a state of the remap rule in astats to notify TR no
>> longer to send traffic on that DS for a short period before the rule is
>> actually removed - all handled inside of ORT).
>>    or
>>  b) prematurely removing the remap rule from astats, before the config on
>> TS is actually updated (at the cost of missing the final few remap stats
>> numbers). This is probably unacceptable.
>> 
>> I’m sure there are other variants on this, but my main goal is for TR to
>> directly learn from the caches which delivery services they actually have
>> available. Rather than the TR learning what TO only thinks each cache has
>> available.
>> 
>> —Eric
>> 
>> 
>> 
>> 
>> 
>>> On Jan 31, 2017, at 8:10 AM, Nir Sopher <ni...@qwilt.com> wrote:
>>> 
>>> Hi,
>>> 
>>> In order to further improve the simplicity and robustness of the control
>>> path for provisioning infrastructure and delivery services, we are
>>> currently considering ways to streamline management and operations.
>>> 
>>> Currently, when applying changes in traffic-control that require the
>>> synchronization between the traffic-router and traffic-servers, the user
>>> should be conscious to do so in a certain order. Otherwise, "black holes"
>>> may be created. Furthermore, in some of the scenarios the user have to
>> wait
>>> and verify that the configuration reached all traffic server before he
>> may
>>> apply it to the traffic-router.
>>> 
>>> We have noticed that TC-3.0 is planned to include a "Config State
>> Machine",
>>> probably dealing with the issue thoroughly. We have no further
>> information
>>> about this bullet and would appreciate any additional info.
>>> 
>>> We would like to start investing in making TC operations more streamline,
>>> robust and user-friendly.
>>> 
>>> The main use-cases we would like to address at this point are:
>>> 
>>>  1. Assign servers to a Delivery-Service.
>>>  For this operation, the configuration must first be applied to the
>> added
>>>  traffic servers, propagate, and only then applied to the
>> traffic-router.
>>>  2. Remove servers assignment to a Delivery-Service.
>>>  For this operation, the configuration must first be applied to the
>>>  traffic-router, and only then to the traffic-servers.
>>>  3. Add a new delivery service.
>>>  This is practically a private case of servers assignment to a
>>>  delivery-service.
>>>  4. Delete a delivery service.
>>>  This is practically a private case of servers assignment removal from a
>>>  delivery-service.
>>>  5. Update settings that must be applied together on the traffic servers
>>>  and the router.
>>> 
>>> We would like to simplify the procedure, allowing the deployment of new
>>> configuration in a single operation, instead of doing it step by step.
>>> 
>>> One solution can be based on the insight that deploying such
>> configuration
>>> changes may be done by initially updating the traffic server with added
>>> functionality (e.g remap-rule), then updating the router, and lastly,
>>> removing old functionality from the traffic servers. Such a solution can
>> be
>>> orchestrated by traffic-ops, probably without complicating other
>> components.
>>> 
>>> Other solutions may provide more flexibility, but would probably involve
>>> adding complexity to other components such as traffic-router.
>>> 
>>> We would be glad to hear the community's thoughts on the matter, so we
>> can
>>> take this further.
>>> 
>>> Thanks,
>>> Nir
>> 
>> 


Re: Streamlining TC management and operations sequences

Posted by Nir Sopher <ni...@qwilt.com>.
Hi Eric,

Formalizing the approach you suggested, one may introduce the concept of a
delivery-service configuration "generation" which would be an ordinal
identifier for the a delivery service configuration. A "generation" changes
whenever the remap rule changes or the consistent hash mapping of content
to server changes (e.g. due to additional server assignment).
I such a solution, each traffic-server may hold a single generation for
each delivery service configuration, while traffic-router may hold a
history of generations and know which server holds which configuration
generation.

This approach introduces a considerable flexibility. It allows
configurations to be set one after the other with no need to wait between
them.
It also fits well with Jeremy's suggestion for queue-update with a delivery
service granularity.

On the other hand, complicated algorithms for solving the issue may impose
more risk to the network when applied, comparing to a simple "traffic-ops"
orchestrated solution.

I'm not sure what is preferable from an operator point of view. I'm also
not familiar with TC 3.0 configuration solution to validate he different
approaches against.

Please share your thoughts,
Thanks,
Nir

On Tue, Jan 31, 2017 at 6:26 PM, Eric Friedrich (efriedri) <
efriedri@cisco.com> wrote:

> What about an approach (apologies, still light on details), where TR
> (perhaps still via TM) discovers the availability of delivery services from
> the cache itself, rather than from the CRConfig file? (Astats or its
> remap_stats based replacement would publish its remap rules)
>
> Any changes to the set of servers (add/remove) or DS assignments would not
> require a specific step to push a changed config to the router. If a cache
> does not yet, or no longer has remap rules for a specific delivery service,
> then TR will not see that rule advertised by the cache and will not send it
> traffic. If adding or removing a server, TM still needs to be updated to
> learn about the new server.
>
> With current configuration, theres a race condition of a few seconds where
> a cache removes remap rule before TM polls and TR gets health info from TM.
> In these few seconds, TR would erroneously send traffic to a cache without
> a proper remap rule.
>
> We could fix this by
>   a) advertising a state of the remap rule in astats to notify TR no
> longer to send traffic on that DS for a short period before the rule is
> actually removed - all handled inside of ORT).
>     or
>   b) prematurely removing the remap rule from astats, before the config on
> TS is actually updated (at the cost of missing the final few remap stats
> numbers). This is probably unacceptable.
>
> I’m sure there are other variants on this, but my main goal is for TR to
> directly learn from the caches which delivery services they actually have
> available. Rather than the TR learning what TO only thinks each cache has
> available.
>
> —Eric
>
>
>
>
>
> > On Jan 31, 2017, at 8:10 AM, Nir Sopher <ni...@qwilt.com> wrote:
> >
> > Hi,
> >
> > In order to further improve the simplicity and robustness of the control
> > path for provisioning infrastructure and delivery services, we are
> > currently considering ways to streamline management and operations.
> >
> > Currently, when applying changes in traffic-control that require the
> > synchronization between the traffic-router and traffic-servers, the user
> > should be conscious to do so in a certain order. Otherwise, "black holes"
> > may be created. Furthermore, in some of the scenarios the user have to
> wait
> > and verify that the configuration reached all traffic server before he
> may
> > apply it to the traffic-router.
> >
> > We have noticed that TC-3.0 is planned to include a "Config State
> Machine",
> > probably dealing with the issue thoroughly. We have no further
> information
> > about this bullet and would appreciate any additional info.
> >
> > We would like to start investing in making TC operations more streamline,
> > robust and user-friendly.
> >
> > The main use-cases we would like to address at this point are:
> >
> >   1. Assign servers to a Delivery-Service.
> >   For this operation, the configuration must first be applied to the
> added
> >   traffic servers, propagate, and only then applied to the
> traffic-router.
> >   2. Remove servers assignment to a Delivery-Service.
> >   For this operation, the configuration must first be applied to the
> >   traffic-router, and only then to the traffic-servers.
> >   3. Add a new delivery service.
> >   This is practically a private case of servers assignment to a
> >   delivery-service.
> >   4. Delete a delivery service.
> >   This is practically a private case of servers assignment removal from a
> >   delivery-service.
> >   5. Update settings that must be applied together on the traffic servers
> >   and the router.
> >
> > We would like to simplify the procedure, allowing the deployment of new
> > configuration in a single operation, instead of doing it step by step.
> >
> > One solution can be based on the insight that deploying such
> configuration
> > changes may be done by initially updating the traffic server with added
> > functionality (e.g remap-rule), then updating the router, and lastly,
> > removing old functionality from the traffic servers. Such a solution can
> be
> > orchestrated by traffic-ops, probably without complicating other
> components.
> >
> > Other solutions may provide more flexibility, but would probably involve
> > adding complexity to other components such as traffic-router.
> >
> > We would be glad to hear the community's thoughts on the matter, so we
> can
> > take this further.
> >
> > Thanks,
> > Nir
>
>

Re: Streamlining TC management and operations sequences

Posted by "Eric Friedrich (efriedri)" <ef...@cisco.com>.
What about an approach (apologies, still light on details), where TR (perhaps still via TM) discovers the availability of delivery services from the cache itself, rather than from the CRConfig file? (Astats or its remap_stats based replacement would publish its remap rules) 

Any changes to the set of servers (add/remove) or DS assignments would not require a specific step to push a changed config to the router. If a cache does not yet, or no longer has remap rules for a specific delivery service, then TR will not see that rule advertised by the cache and will not send it traffic. If adding or removing a server, TM still needs to be updated to learn about the new server. 

With current configuration, theres a race condition of a few seconds where a cache removes remap rule before TM polls and TR gets health info from TM. In these few seconds, TR would erroneously send traffic to a cache without a proper remap rule. 

We could fix this by 
  a) advertising a state of the remap rule in astats to notify TR no longer to send traffic on that DS for a short period before the rule is actually removed - all handled inside of ORT). 
    or
  b) prematurely removing the remap rule from astats, before the config on TS is actually updated (at the cost of missing the final few remap stats numbers). This is probably unacceptable. 

I’m sure there are other variants on this, but my main goal is for TR to directly learn from the caches which delivery services they actually have available. Rather than the TR learning what TO only thinks each cache has available. 

—Eric





> On Jan 31, 2017, at 8:10 AM, Nir Sopher <ni...@qwilt.com> wrote:
> 
> Hi,
> 
> In order to further improve the simplicity and robustness of the control
> path for provisioning infrastructure and delivery services, we are
> currently considering ways to streamline management and operations.
> 
> Currently, when applying changes in traffic-control that require the
> synchronization between the traffic-router and traffic-servers, the user
> should be conscious to do so in a certain order. Otherwise, "black holes"
> may be created. Furthermore, in some of the scenarios the user have to wait
> and verify that the configuration reached all traffic server before he may
> apply it to the traffic-router.
> 
> We have noticed that TC-3.0 is planned to include a "Config State Machine",
> probably dealing with the issue thoroughly. We have no further information
> about this bullet and would appreciate any additional info.
> 
> We would like to start investing in making TC operations more streamline,
> robust and user-friendly.
> 
> The main use-cases we would like to address at this point are:
> 
>   1. Assign servers to a Delivery-Service.
>   For this operation, the configuration must first be applied to the added
>   traffic servers, propagate, and only then applied to the traffic-router.
>   2. Remove servers assignment to a Delivery-Service.
>   For this operation, the configuration must first be applied to the
>   traffic-router, and only then to the traffic-servers.
>   3. Add a new delivery service.
>   This is practically a private case of servers assignment to a
>   delivery-service.
>   4. Delete a delivery service.
>   This is practically a private case of servers assignment removal from a
>   delivery-service.
>   5. Update settings that must be applied together on the traffic servers
>   and the router.
> 
> We would like to simplify the procedure, allowing the deployment of new
> configuration in a single operation, instead of doing it step by step.
> 
> One solution can be based on the insight that deploying such configuration
> changes may be done by initially updating the traffic server with added
> functionality (e.g remap-rule), then updating the router, and lastly,
> removing old functionality from the traffic servers. Such a solution can be
> orchestrated by traffic-ops, probably without complicating other components.
> 
> Other solutions may provide more flexibility, but would probably involve
> adding complexity to other components such as traffic-router.
> 
> We would be glad to hear the community's thoughts on the matter, so we can
> take this further.
> 
> Thanks,
> Nir