You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@trafficcontrol.apache.org by Dave Neuman <ne...@apache.org> on 2021/06/17 22:09:37 UTC

Distributed Traffic Monitor Feedback/Requirements

Hey All,
One of the things we have been talking about doing for a long time is
making Traffic Monitor capable of monitoring a subset of the CDN so that it
can be deployed in a distributed fashion.  The time has come for us to get
moving on this.  We have had some discussions internally to understand what
requirements we have for doing this, but I wanted to solicit feedback from
the community to see if there are potentially other requirements that we
may have missed.  Please take a look at the requirements we have identified
below and let me know what feedback you have.  At this point in time I am
trying to keep this conversation separate from the design conversation and
just focus on the requirements.  Once we all agree on the requirements we
can start discussing the design.  You will notice that this proposal also
includes adding the ability to integrate with external monitoring systems.
I figured now would be a good time to add that functionality in as well.


*Abstract*

Update Traffic Monitor so that it is capable of monitoring only part of the
CDN while still providing a single API for clients to get cache stats,
delivery stats, and cache availability for a whole CDN.  Add the ability to
integrate with other systems that perform additional health monitoring and
consider the status of these systems when making health decisions for a
cache.  Ensure that the Traffic Monitor API is capable of serving thousands
of simultaneous clients, such as all of the caches in a CDN.


*Problem Statement*

Currently Traffic Monitor can only monitor an entire CDN. This means that
Traffic Monitor has to poll every single cache in a CDN before making cache
health decisions and being able to provide statistics. This also means that
Traffic Monitors need to be located in a centralized place where it can get
to everything, which isn't exactly representative of what a client might
see. While this has worked really well for us to date, we know that at some
point we will run into scaling issues which prohibit us from polling caches
faster.  In order to solve our impending scaling issues as well as improve
our ability to make better and faster health decisions, Traffic Monitor
needs to run in a distributed fashion instead of an all or nothing
fashion.

Furthermore, there is a growing need to provide support for external
monitoring systems in Traffic Monitor.  Traffic Monitor needs to be able to
use other monitoring systems to aid in the health decision process. While
this could be solved in today's Traffic Monitor, it is best to solve this
problem in conjunction with making the polling distributed.
*Business Justification*

In order to provide the best customer experience possible, we need to have
a robust and timely health monitoring system.  While Traffic Monitor has
been sufficient to date, we need to make sure that we are adapting to meet
the needs of the near future and we need to make sure that we are evolving
to continue to meet customers needs.  These changes to Traffic Monitor are
imperative to providing as near real time as possible cache health data on
our ever increasing in scale of the CDN.
*Business Requirements*

   - Traffic Monitor MUST be capable of being configured to monitor a
   portion of a CDN
   - Traffic Monitor MUST be capable of being configured to monitor all
   caches in a CDN
   - Traffic Monitor MUST provide an API to get the health status of ALL
   caches in the CDN
   - Traffic Monitor MUST provide an API to get statistics (from e.g.
   astats data) generated by ALL caches in the CDN. This does not include any
   statistics generated by external monitoring systems.
   - Traffic Monitor MUST log all requests to its API including AT LEAST
   the following information: timestamp, client IP, resource requested,
   response code, response reason, time to serve.
   - Traffic Monitor MUST provide an API to get the status of caches it
   monitors
   - Traffic Monitor MUST log all health state changes for a cache whether
   the decision is made internally or from an external system.
   - Traffic Monitor MUST provide the ability to have more than 1 Traffic
   Monitor monitor the same cache and come to consensus on the health of the
   cache.
   - Traffic Monitor SHOULD provide the way to configure more than one
   subset of caches to monitor – e.g. as a primary and backup.
   - Traffic Monitor SHOULD provide a way to integrate with external
   services to provide additional cache health monitoring
   - Traffic Monitor SHOULD have the capability to provide a non-boolean
   health score for a cache - e.g. a number between 0 - 100
   - Traffic Monitor MAY be decoupled from Traffic Ops for configuration
   generation

Re: Distributed Traffic Monitor Feedback/Requirements

Posted by Dave Neuman <ne...@apache.org>.

Sounds great, thanks Eric!
I am looking forward to the design discussions.
--Dave

On Fri, Jun 25, 2021 at 9:17 AM Eric Friedrich <fr...@apache.org> wrote:

> I'll do my best to rephrase as a potential requirement :-)
>
> 1) Traffic Monitor MUST ensure all caches are monitored upon failure of any
> TM server(s) or physical location. (i.e. no SPoF of TMs for
> polling/aggregation).
>
> Number of TM failures to be tolerated before we stop polling some caches /
> how we accomplish the above/ maximum number of caches under supervision by
> a TM are all TBD in design phase
>
> --Eric
>
> On Fri, Jun 25, 2021 at 10:36 AM Dave Neuman <ne...@apache.org> wrote:
>
> > Hey Eric,
> > Thanks for the questions/feedback.  My responses are inline below.  Most
> of
> > your questions will need to be addressed when we do design as right now I
> > just want to make sure we are not missing any requirements.  I hope to
> > start design discussions in the next week or two.
> >
> > Thanks,
> > Dave
> >
> > On Fri, Jun 25, 2021 at 7:26 AM Eric Friedrich <fr...@apache.org>
> wrote:
> >
> > > Some comments and questions jointly compiled
> > >
> > >   - How is TM configured to monitor a subset of a CDN, is it a static
> > > allocation of caches to TMs?
> > >
> >
> > DN:  I think that is to be determined when we start to think about
> design,
> > which is after we agree on the requirements.  I think for our use case
> the
> > most simple way to do this would be by cache group.  A Traffic Monitor
> > could be configured to monitor 1 to many cache groups.  However, if there
> > is a better way we could do this, I am all ears.
> >
> > >
> > >   - Can you describe how the primary + backup work. Do they both poll
> the
> > > cache simultaneously
> > >
> >
> > DN: Again, I think we can sort out the details when we talk about design.
> > It actually might make more sense to just have multiple TMs monitor a
> cache
> > group and treat them all as "live", this has the benefit of providing
> more
> > than one view of a cache.
> >
> >
> > >   - If a TM fails, how do the TMs heal / reallocate polling
> > > responsibilities. Does another TM pick up the slack?
> > >
> >
> > DN:  You want to dive straight into design :). I think the easiest answer
> > here is to ensure multiple TMs are polling each cache and that they are
> all
> > treated as live, then we can just use the optimistic consensus that is
> > already built into TM.
> >
> >
> > >
> > >   - What prevents a misconfiguration where some caches are not polled
> by
> > > any TM?
> > >
> >
> > DN:  Great question.  I don't think that is one I have considered, but I
> > suppose we could add a requirement saying that TM must have a way to
> > identify unpolled caches...what do you think?
> >
> >
> > >
> > >   - Are there any minimums/maximums to how many TMs will poll a cache?
> > >
> > DN: Minimum is one, maximum is up to the operator, I don't know of a
> limit
> > in TM.
> >
> >
> > >
> > >   - What is meaning of non-boolean 0-100 health? How is this computed
> and
> > > how is it used?
> > >
> >
> > DN:  The health score stuff is going to be an entirely different topic
> > because I don't think it needs to be conflated with distributed
> polling.  I
> > put that requirement in because I wanted to document that this is
> something
> > we are thinking about so that we don't make it difficult on ourselves
> when
> > we do this refactor.
> > Right now a cache's health is boolean, it either gets traffic or it
> > doesn't.  The idea behind the health score is that we could assign
> > different health scores for caches in a cache group and then TR can use
> > that when determining which cache to choose.  Maybe you have multiple
> > caches that are getting close to the bandwidth limit, instead of pulling
> > all traffic from them, we could simply weight them lower so the TR
> prefers
> > other caches, but can still use them if needed. We have a bunch of other
> > use cases that are probably best saved for when we are ready to formally
> > present the idea.
> >
> >
> > >
> > >   - What can we do to further harden TM<->TM communications and reduce
> > > blast radius?
> > >
> >
> > DN:  Another topic for the design discussions, I think the basic idea is
> to
> > not have a SPoF which means multiple TMs polling each cache and multiple
> > TMs available to provide status to TRs, Caches, and TSs.
> >
> >
> >
> > > Big thumbs up on decoupling TM from Traffic Ops. What does this
> > practically
> > > mean - no more monitoring.json? Can we document specifically which APIs
> > TM
> > > will use?
> > > (Aside, we might want to think about this as an opportunity to move TM
> > into
> > > its own repository- assuming the community decides to go ahead with
> > > separate repos per component).
> > >
> >
> > DN:  I think that is a stretch goal for now.  TM will still have to get
> > it's configuration from somewhere, but ideally it does not have to come
> > from TO.  Ultimately I would like TO to just serve the basic data from
> the
> > database and build services that can be used to generate configs using
> > business logic.  We sort of did this with t3c where it gets all of the
> > information it needs from TO without relying on config file APIs
> > that used to be in TO (maybe still are?).  However, t3c is purely client
> > side and I prefer a more centralized approach with something like a TM
> > configuration service that can read from TO and use the data to populate
> > APIs for TM to get it's config.  That way we could define just the data
> we
> > need in TM and a user could choose to run the TM configuration service
> > which talks to TO or provide the required data using a different backend
> > system.  I think this is probably a larger conversation we need to have
> > when we start talking about how we are going to design the distributed
> TM.
> >
> > As for its own repo, that is a larger conversation.  I am not sure what
> > that means for all of the ancillary pieces like cdn-in-a-box, the pkg
> > script, etc. If it is worth the trouble then I am all for it, but I don't
> > think we should let this thread get bogged down with that conversation.
> >
> > >
> > >
> > >
> > > On Thu, Jun 17, 2021 at 6:09 PM Dave Neuman <ne...@apache.org> wrote:
> > >
> > > > Hey All,
> > > > One of the things we have been talking about doing for a long time is
> > > > making Traffic Monitor capable of monitoring a subset of the CDN so
> > that
> > > it
> > > > can be deployed in a distributed fashion.  The time has come for us
> to
> > > get
> > > > moving on this.  We have had some discussions internally to
> understand
> > > what
> > > > requirements we have for doing this, but I wanted to solicit feedback
> > > from
> > > > the community to see if there are potentially other requirements that
> > we
> > > > may have missed.  Please take a look at the requirements we have
> > > identified
> > > > below and let me know what feedback you have.  At this point in time
> I
> > am
> > > > trying to keep this conversation separate from the design
> conversation
> > > and
> > > > just focus on the requirements.  Once we all agree on the
> requirements
> > we
> > > > can start discussing the design.  You will notice that this proposal
> > also
> > > > includes adding the ability to integrate with external monitoring
> > > systems.
> > > > I figured now would be a good time to add that functionality in as
> > well.
> > > >
> > > >
> > > > *Abstract*
> > > >
> > > > Update Traffic Monitor so that it is capable of monitoring only part
> of
> > > the
> > > > CDN while still providing a single API for clients to get cache
> stats,
> > > > delivery stats, and cache availability for a whole CDN.  Add the
> > ability
> > > to
> > > > integrate with other systems that perform additional health
> monitoring
> > > and
> > > > consider the status of these systems when making health decisions
> for a
> > > > cache.  Ensure that the Traffic Monitor API is capable of serving
> > > thousands
> > > > of simultaneous clients, such as all of the caches in a CDN.
> > > >
> > > >
> > > > *Problem Statement*
> > > >
> > > > Currently Traffic Monitor can only monitor an entire CDN. This means
> > that
> > > > Traffic Monitor has to poll every single cache in a CDN before making
> > > cache
> > > > health decisions and being able to provide statistics. This also
> means
> > > that
> > > > Traffic Monitors need to be located in a centralized place where it
> can
> > > get
> > > > to everything, which isn't exactly representative of what a client
> > might
> > > > see. While this has worked really well for us to date, we know that
> at
> > > some
> > > > point we will run into scaling issues which prohibit us from polling
> > > caches
> > > > faster.  In order to solve our impending scaling issues as well as
> > > improve
> > > > our ability to make better and faster health decisions, Traffic
> Monitor
> > > > needs to run in a distributed fashion instead of an all or nothing
> > > > fashion.
> > > >
> > > > Furthermore, there is a growing need to provide support for external
> > > > monitoring systems in Traffic Monitor.  Traffic Monitor needs to be
> > able
> > > to
> > > > use other monitoring systems to aid in the health decision process.
> > While
> > > > this could be solved in today's Traffic Monitor, it is best to solve
> > this
> > > > problem in conjunction with making the polling distributed.
> > > > *Business Justification*
> > > >
> > > > In order to provide the best customer experience possible, we need to
> > > have
> > > > a robust and timely health monitoring system.  While Traffic Monitor
> > has
> > > > been sufficient to date, we need to make sure that we are adapting to
> > > meet
> > > > the needs of the near future and we need to make sure that we are
> > > evolving
> > > > to continue to meet customers needs.  These changes to Traffic
> Monitor
> > > are
> > > > imperative to providing as near real time as possible cache health
> data
> > > on
> > > > our ever increasing in scale of the CDN.
> > > > *Business Requirements*
> > > >
> > > >    - Traffic Monitor MUST be capable of being configured to monitor a
> > > >    portion of a CDN
> > > >    - Traffic Monitor MUST be capable of being configured to monitor
> all
> > > >    caches in a CDN
> > > >    - Traffic Monitor MUST provide an API to get the health status of
> > ALL
> > > >    caches in the CDN
> > > >    - Traffic Monitor MUST provide an API to get statistics (from e.g.
> > > >    astats data) generated by ALL caches in the CDN. This does not
> > include
> > > > any
> > > >    statistics generated by external monitoring systems.
> > > >    - Traffic Monitor MUST log all requests to its API including AT
> > LEAST
> > > >    the following information: timestamp, client IP, resource
> requested,
> > > >    response code, response reason, time to serve.
> > > >    - Traffic Monitor MUST provide an API to get the status of caches
> it
> > > >    monitors
> > > >    - Traffic Monitor MUST log all health state changes for a cache
> > > whether
> > > >    the decision is made internally or from an external system.
> > > >    - Traffic Monitor MUST provide the ability to have more than 1
> > Traffic
> > > >    Monitor monitor the same cache and come to consensus on the health
> > of
> > > > the
> > > >    cache.
> > > >    - Traffic Monitor SHOULD provide the way to configure more than
> one
> > > >    subset of caches to monitor – e.g. as a primary and backup.
> > > >    - Traffic Monitor SHOULD provide a way to integrate with external
> > > >    services to provide additional cache health monitoring
> > > >    - Traffic Monitor SHOULD have the capability to provide a
> > non-boolean
> > > >    health score for a cache - e.g. a number between 0 - 100
> > > >    - Traffic Monitor MAY be decoupled from Traffic Ops for
> > configuration
> > > >    generation
> > > >
> > >
> >
>

Re: Distributed Traffic Monitor Feedback/Requirements

Posted by Eric Friedrich <fr...@apache.org>.

I'll do my best to rephrase as a potential requirement :-)

1) Traffic Monitor MUST ensure all caches are monitored upon failure of any
TM server(s) or physical location. (i.e. no SPoF of TMs for
polling/aggregation).

Number of TM failures to be tolerated before we stop polling some caches /
how we accomplish the above/ maximum number of caches under supervision by
a TM are all TBD in design phase

--Eric

On Fri, Jun 25, 2021 at 10:36 AM Dave Neuman <ne...@apache.org> wrote:

> Hey Eric,
> Thanks for the questions/feedback.  My responses are inline below.  Most of
> your questions will need to be addressed when we do design as right now I
> just want to make sure we are not missing any requirements.  I hope to
> start design discussions in the next week or two.
>
> Thanks,
> Dave
>
> On Fri, Jun 25, 2021 at 7:26 AM Eric Friedrich <fr...@apache.org> wrote:
>
> > Some comments and questions jointly compiled
> >
> >   - How is TM configured to monitor a subset of a CDN, is it a static
> > allocation of caches to TMs?
> >
>
> DN:  I think that is to be determined when we start to think about design,
> which is after we agree on the requirements.  I think for our use case the
> most simple way to do this would be by cache group.  A Traffic Monitor
> could be configured to monitor 1 to many cache groups.  However, if there
> is a better way we could do this, I am all ears.
>
> >
> >   - Can you describe how the primary + backup work. Do they both poll the
> > cache simultaneously
> >
>
> DN: Again, I think we can sort out the details when we talk about design.
> It actually might make more sense to just have multiple TMs monitor a cache
> group and treat them all as "live", this has the benefit of providing more
> than one view of a cache.
>
>
> >   - If a TM fails, how do the TMs heal / reallocate polling
> > responsibilities. Does another TM pick up the slack?
> >
>
> DN:  You want to dive straight into design :). I think the easiest answer
> here is to ensure multiple TMs are polling each cache and that they are all
> treated as live, then we can just use the optimistic consensus that is
> already built into TM.
>
>
> >
> >   - What prevents a misconfiguration where some caches are not polled by
> > any TM?
> >
>
> DN:  Great question.  I don't think that is one I have considered, but I
> suppose we could add a requirement saying that TM must have a way to
> identify unpolled caches...what do you think?
>
>
> >
> >   - Are there any minimums/maximums to how many TMs will poll a cache?
> >
> DN: Minimum is one, maximum is up to the operator, I don't know of a limit
> in TM.
>
>
> >
> >   - What is meaning of non-boolean 0-100 health? How is this computed and
> > how is it used?
> >
>
> DN:  The health score stuff is going to be an entirely different topic
> because I don't think it needs to be conflated with distributed polling.  I
> put that requirement in because I wanted to document that this is something
> we are thinking about so that we don't make it difficult on ourselves when
> we do this refactor.
> Right now a cache's health is boolean, it either gets traffic or it
> doesn't.  The idea behind the health score is that we could assign
> different health scores for caches in a cache group and then TR can use
> that when determining which cache to choose.  Maybe you have multiple
> caches that are getting close to the bandwidth limit, instead of pulling
> all traffic from them, we could simply weight them lower so the TR prefers
> other caches, but can still use them if needed. We have a bunch of other
> use cases that are probably best saved for when we are ready to formally
> present the idea.
>
>
> >
> >   - What can we do to further harden TM<->TM communications and reduce
> > blast radius?
> >
>
> DN:  Another topic for the design discussions, I think the basic idea is to
> not have a SPoF which means multiple TMs polling each cache and multiple
> TMs available to provide status to TRs, Caches, and TSs.
>
>
>
> > Big thumbs up on decoupling TM from Traffic Ops. What does this
> practically
> > mean - no more monitoring.json? Can we document specifically which APIs
> TM
> > will use?
> > (Aside, we might want to think about this as an opportunity to move TM
> into
> > its own repository- assuming the community decides to go ahead with
> > separate repos per component).
> >
>
> DN:  I think that is a stretch goal for now.  TM will still have to get
> it's configuration from somewhere, but ideally it does not have to come
> from TO.  Ultimately I would like TO to just serve the basic data from the
> database and build services that can be used to generate configs using
> business logic.  We sort of did this with t3c where it gets all of the
> information it needs from TO without relying on config file APIs
> that used to be in TO (maybe still are?).  However, t3c is purely client
> side and I prefer a more centralized approach with something like a TM
> configuration service that can read from TO and use the data to populate
> APIs for TM to get it's config.  That way we could define just the data we
> need in TM and a user could choose to run the TM configuration service
> which talks to TO or provide the required data using a different backend
> system.  I think this is probably a larger conversation we need to have
> when we start talking about how we are going to design the distributed TM.
>
> As for its own repo, that is a larger conversation.  I am not sure what
> that means for all of the ancillary pieces like cdn-in-a-box, the pkg
> script, etc. If it is worth the trouble then I am all for it, but I don't
> think we should let this thread get bogged down with that conversation.
>
> >
> >
> >
> > On Thu, Jun 17, 2021 at 6:09 PM Dave Neuman <ne...@apache.org> wrote:
> >
> > > Hey All,
> > > One of the things we have been talking about doing for a long time is
> > > making Traffic Monitor capable of monitoring a subset of the CDN so
> that
> > it
> > > can be deployed in a distributed fashion.  The time has come for us to
> > get
> > > moving on this.  We have had some discussions internally to understand
> > what
> > > requirements we have for doing this, but I wanted to solicit feedback
> > from
> > > the community to see if there are potentially other requirements that
> we
> > > may have missed.  Please take a look at the requirements we have
> > identified
> > > below and let me know what feedback you have.  At this point in time I
> am
> > > trying to keep this conversation separate from the design conversation
> > and
> > > just focus on the requirements.  Once we all agree on the requirements
> we
> > > can start discussing the design.  You will notice that this proposal
> also
> > > includes adding the ability to integrate with external monitoring
> > systems.
> > > I figured now would be a good time to add that functionality in as
> well.
> > >
> > >
> > > *Abstract*
> > >
> > > Update Traffic Monitor so that it is capable of monitoring only part of
> > the
> > > CDN while still providing a single API for clients to get cache stats,
> > > delivery stats, and cache availability for a whole CDN.  Add the
> ability
> > to
> > > integrate with other systems that perform additional health monitoring
> > and
> > > consider the status of these systems when making health decisions for a
> > > cache.  Ensure that the Traffic Monitor API is capable of serving
> > thousands
> > > of simultaneous clients, such as all of the caches in a CDN.
> > >
> > >
> > > *Problem Statement*
> > >
> > > Currently Traffic Monitor can only monitor an entire CDN. This means
> that
> > > Traffic Monitor has to poll every single cache in a CDN before making
> > cache
> > > health decisions and being able to provide statistics. This also means
> > that
> > > Traffic Monitors need to be located in a centralized place where it can
> > get
> > > to everything, which isn't exactly representative of what a client
> might
> > > see. While this has worked really well for us to date, we know that at
> > some
> > > point we will run into scaling issues which prohibit us from polling
> > caches
> > > faster.  In order to solve our impending scaling issues as well as
> > improve
> > > our ability to make better and faster health decisions, Traffic Monitor
> > > needs to run in a distributed fashion instead of an all or nothing
> > > fashion.
> > >
> > > Furthermore, there is a growing need to provide support for external
> > > monitoring systems in Traffic Monitor.  Traffic Monitor needs to be
> able
> > to
> > > use other monitoring systems to aid in the health decision process.
> While
> > > this could be solved in today's Traffic Monitor, it is best to solve
> this
> > > problem in conjunction with making the polling distributed.
> > > *Business Justification*
> > >
> > > In order to provide the best customer experience possible, we need to
> > have
> > > a robust and timely health monitoring system.  While Traffic Monitor
> has
> > > been sufficient to date, we need to make sure that we are adapting to
> > meet
> > > the needs of the near future and we need to make sure that we are
> > evolving
> > > to continue to meet customers needs.  These changes to Traffic Monitor
> > are
> > > imperative to providing as near real time as possible cache health data
> > on
> > > our ever increasing in scale of the CDN.
> > > *Business Requirements*
> > >
> > >    - Traffic Monitor MUST be capable of being configured to monitor a
> > >    portion of a CDN
> > >    - Traffic Monitor MUST be capable of being configured to monitor all
> > >    caches in a CDN
> > >    - Traffic Monitor MUST provide an API to get the health status of
> ALL
> > >    caches in the CDN
> > >    - Traffic Monitor MUST provide an API to get statistics (from e.g.
> > >    astats data) generated by ALL caches in the CDN. This does not
> include
> > > any
> > >    statistics generated by external monitoring systems.
> > >    - Traffic Monitor MUST log all requests to its API including AT
> LEAST
> > >    the following information: timestamp, client IP, resource requested,
> > >    response code, response reason, time to serve.
> > >    - Traffic Monitor MUST provide an API to get the status of caches it
> > >    monitors
> > >    - Traffic Monitor MUST log all health state changes for a cache
> > whether
> > >    the decision is made internally or from an external system.
> > >    - Traffic Monitor MUST provide the ability to have more than 1
> Traffic
> > >    Monitor monitor the same cache and come to consensus on the health
> of
> > > the
> > >    cache.
> > >    - Traffic Monitor SHOULD provide the way to configure more than one
> > >    subset of caches to monitor – e.g. as a primary and backup.
> > >    - Traffic Monitor SHOULD provide a way to integrate with external
> > >    services to provide additional cache health monitoring
> > >    - Traffic Monitor SHOULD have the capability to provide a
> non-boolean
> > >    health score for a cache - e.g. a number between 0 - 100
> > >    - Traffic Monitor MAY be decoupled from Traffic Ops for
> configuration
> > >    generation
> > >
> >
>

Re: Distributed Traffic Monitor Feedback/Requirements

Posted by Dave Neuman <ne...@apache.org>.

Hey Eric,
Thanks for the questions/feedback.  My responses are inline below.  Most of
your questions will need to be addressed when we do design as right now I
just want to make sure we are not missing any requirements.  I hope to
start design discussions in the next week or two.

Thanks,
Dave

On Fri, Jun 25, 2021 at 7:26 AM Eric Friedrich <fr...@apache.org> wrote:

> Some comments and questions jointly compiled
>
>   - How is TM configured to monitor a subset of a CDN, is it a static
> allocation of caches to TMs?
>

DN:  I think that is to be determined when we start to think about design,
which is after we agree on the requirements.  I think for our use case the
most simple way to do this would be by cache group.  A Traffic Monitor
could be configured to monitor 1 to many cache groups.  However, if there
is a better way we could do this, I am all ears.

>
>   - Can you describe how the primary + backup work. Do they both poll the
> cache simultaneously
>

DN: Again, I think we can sort out the details when we talk about design.
It actually might make more sense to just have multiple TMs monitor a cache
group and treat them all as "live", this has the benefit of providing more
than one view of a cache.

>   - If a TM fails, how do the TMs heal / reallocate polling
> responsibilities. Does another TM pick up the slack?
>

DN:  You want to dive straight into design :). I think the easiest answer
here is to ensure multiple TMs are polling each cache and that they are all
treated as live, then we can just use the optimistic consensus that is
already built into TM.

>
>   - What prevents a misconfiguration where some caches are not polled by
> any TM?
>

DN:  Great question.  I don't think that is one I have considered, but I
suppose we could add a requirement saying that TM must have a way to
identify unpolled caches...what do you think?

>
>   - Are there any minimums/maximums to how many TMs will poll a cache?
>
DN: Minimum is one, maximum is up to the operator, I don't know of a limit
in TM.

>
>   - What is meaning of non-boolean 0-100 health? How is this computed and
> how is it used?
>

DN:  The health score stuff is going to be an entirely different topic
because I don't think it needs to be conflated with distributed polling.  I
put that requirement in because I wanted to document that this is something
we are thinking about so that we don't make it difficult on ourselves when
we do this refactor.
Right now a cache's health is boolean, it either gets traffic or it
doesn't.  The idea behind the health score is that we could assign
different health scores for caches in a cache group and then TR can use
that when determining which cache to choose.  Maybe you have multiple
caches that are getting close to the bandwidth limit, instead of pulling
all traffic from them, we could simply weight them lower so the TR prefers
other caches, but can still use them if needed. We have a bunch of other
use cases that are probably best saved for when we are ready to formally
present the idea.

>
>   - What can we do to further harden TM<->TM communications and reduce
> blast radius?
>

DN:  Another topic for the design discussions, I think the basic idea is to
not have a SPoF which means multiple TMs polling each cache and multiple
TMs available to provide status to TRs, Caches, and TSs.

> Big thumbs up on decoupling TM from Traffic Ops. What does this practically
> mean - no more monitoring.json? Can we document specifically which APIs TM
> will use?
> (Aside, we might want to think about this as an opportunity to move TM into
> its own repository- assuming the community decides to go ahead with
> separate repos per component).
>

DN:  I think that is a stretch goal for now.  TM will still have to get
it's configuration from somewhere, but ideally it does not have to come
from TO.  Ultimately I would like TO to just serve the basic data from the
database and build services that can be used to generate configs using
business logic.  We sort of did this with t3c where it gets all of the
information it needs from TO without relying on config file APIs
that used to be in TO (maybe still are?).  However, t3c is purely client
side and I prefer a more centralized approach with something like a TM
configuration service that can read from TO and use the data to populate
APIs for TM to get it's config.  That way we could define just the data we
need in TM and a user could choose to run the TM configuration service
which talks to TO or provide the required data using a different backend
system.  I think this is probably a larger conversation we need to have
when we start talking about how we are going to design the distributed TM.

As for its own repo, that is a larger conversation.  I am not sure what
that means for all of the ancillary pieces like cdn-in-a-box, the pkg
script, etc. If it is worth the trouble then I am all for it, but I don't
think we should let this thread get bogged down with that conversation.

>
>
>
> On Thu, Jun 17, 2021 at 6:09 PM Dave Neuman <ne...@apache.org> wrote:
>
> > Hey All,
> > One of the things we have been talking about doing for a long time is
> > making Traffic Monitor capable of monitoring a subset of the CDN so that
> it
> > can be deployed in a distributed fashion.  The time has come for us to
> get
> > moving on this.  We have had some discussions internally to understand
> what
> > requirements we have for doing this, but I wanted to solicit feedback
> from
> > the community to see if there are potentially other requirements that we
> > may have missed.  Please take a look at the requirements we have
> identified
> > below and let me know what feedback you have.  At this point in time I am
> > trying to keep this conversation separate from the design conversation
> and
> > just focus on the requirements.  Once we all agree on the requirements we
> > can start discussing the design.  You will notice that this proposal also
> > includes adding the ability to integrate with external monitoring
> systems.
> > I figured now would be a good time to add that functionality in as well.
> >
> >
> > *Abstract*
> >
> > Update Traffic Monitor so that it is capable of monitoring only part of
> the
> > CDN while still providing a single API for clients to get cache stats,
> > delivery stats, and cache availability for a whole CDN.  Add the ability
> to
> > integrate with other systems that perform additional health monitoring
> and
> > consider the status of these systems when making health decisions for a
> > cache.  Ensure that the Traffic Monitor API is capable of serving
> thousands
> > of simultaneous clients, such as all of the caches in a CDN.
> >
> >
> > *Problem Statement*
> >
> > Currently Traffic Monitor can only monitor an entire CDN. This means that
> > Traffic Monitor has to poll every single cache in a CDN before making
> cache
> > health decisions and being able to provide statistics. This also means
> that
> > Traffic Monitors need to be located in a centralized place where it can
> get
> > to everything, which isn't exactly representative of what a client might
> > see. While this has worked really well for us to date, we know that at
> some
> > point we will run into scaling issues which prohibit us from polling
> caches
> > faster.  In order to solve our impending scaling issues as well as
> improve
> > our ability to make better and faster health decisions, Traffic Monitor
> > needs to run in a distributed fashion instead of an all or nothing
> > fashion.
> >
> > Furthermore, there is a growing need to provide support for external
> > monitoring systems in Traffic Monitor.  Traffic Monitor needs to be able
> to
> > use other monitoring systems to aid in the health decision process. While
> > this could be solved in today's Traffic Monitor, it is best to solve this
> > problem in conjunction with making the polling distributed.
> > *Business Justification*
> >
> > In order to provide the best customer experience possible, we need to
> have
> > a robust and timely health monitoring system.  While Traffic Monitor has
> > been sufficient to date, we need to make sure that we are adapting to
> meet
> > the needs of the near future and we need to make sure that we are
> evolving
> > to continue to meet customers needs.  These changes to Traffic Monitor
> are
> > imperative to providing as near real time as possible cache health data
> on
> > our ever increasing in scale of the CDN.
> > *Business Requirements*
> >
> >    - Traffic Monitor MUST be capable of being configured to monitor a
> >    portion of a CDN
> >    - Traffic Monitor MUST be capable of being configured to monitor all
> >    caches in a CDN
> >    - Traffic Monitor MUST provide an API to get the health status of ALL
> >    caches in the CDN
> >    - Traffic Monitor MUST provide an API to get statistics (from e.g.
> >    astats data) generated by ALL caches in the CDN. This does not include
> > any
> >    statistics generated by external monitoring systems.
> >    - Traffic Monitor MUST log all requests to its API including AT LEAST
> >    the following information: timestamp, client IP, resource requested,
> >    response code, response reason, time to serve.
> >    - Traffic Monitor MUST provide an API to get the status of caches it
> >    monitors
> >    - Traffic Monitor MUST log all health state changes for a cache
> whether
> >    the decision is made internally or from an external system.
> >    - Traffic Monitor MUST provide the ability to have more than 1 Traffic
> >    Monitor monitor the same cache and come to consensus on the health of
> > the
> >    cache.
> >    - Traffic Monitor SHOULD provide the way to configure more than one
> >    subset of caches to monitor – e.g. as a primary and backup.
> >    - Traffic Monitor SHOULD provide a way to integrate with external
> >    services to provide additional cache health monitoring
> >    - Traffic Monitor SHOULD have the capability to provide a non-boolean
> >    health score for a cache - e.g. a number between 0 - 100
> >    - Traffic Monitor MAY be decoupled from Traffic Ops for configuration
> >    generation
> >
>

Re: Distributed Traffic Monitor Feedback/Requirements

Posted by Eric Friedrich <fr...@apache.org>.

Some comments and questions jointly compiled

  - How is TM configured to monitor a subset of a CDN, is it a static
allocation of caches to TMs?

  - Can you describe how the primary + backup work. Do they both poll the
cache simultaneously

  - If a TM fails, how do the TMs heal / reallocate polling
responsibilities. Does another TM pick up the slack?

  - What prevents a misconfiguration where some caches are not polled by
any TM?

  - Are there any minimums/maximums to how many TMs will poll a cache?

  - What is meaning of non-boolean 0-100 health? How is this computed and
how is it used?

  - What can we do to further harden TM<->TM communications and reduce
blast radius?

Big thumbs up on decoupling TM from Traffic Ops. What does this practically
mean - no more monitoring.json? Can we document specifically which APIs TM
will use?
(Aside, we might want to think about this as an opportunity to move TM into
its own repository- assuming the community decides to go ahead with
separate repos per component).



On Thu, Jun 17, 2021 at 6:09 PM Dave Neuman <ne...@apache.org> wrote:

> Hey All,
> One of the things we have been talking about doing for a long time is
> making Traffic Monitor capable of monitoring a subset of the CDN so that it
> can be deployed in a distributed fashion.  The time has come for us to get
> moving on this.  We have had some discussions internally to understand what
> requirements we have for doing this, but I wanted to solicit feedback from
> the community to see if there are potentially other requirements that we
> may have missed.  Please take a look at the requirements we have identified
> below and let me know what feedback you have.  At this point in time I am
> trying to keep this conversation separate from the design conversation and
> just focus on the requirements.  Once we all agree on the requirements we
> can start discussing the design.  You will notice that this proposal also
> includes adding the ability to integrate with external monitoring systems.
> I figured now would be a good time to add that functionality in as well.
>
>
> *Abstract*
>
> Update Traffic Monitor so that it is capable of monitoring only part of the
> CDN while still providing a single API for clients to get cache stats,
> delivery stats, and cache availability for a whole CDN.  Add the ability to
> integrate with other systems that perform additional health monitoring and
> consider the status of these systems when making health decisions for a
> cache.  Ensure that the Traffic Monitor API is capable of serving thousands
> of simultaneous clients, such as all of the caches in a CDN.
>
>
> *Problem Statement*
>
> Currently Traffic Monitor can only monitor an entire CDN. This means that
> Traffic Monitor has to poll every single cache in a CDN before making cache
> health decisions and being able to provide statistics. This also means that
> Traffic Monitors need to be located in a centralized place where it can get
> to everything, which isn't exactly representative of what a client might
> see. While this has worked really well for us to date, we know that at some
> point we will run into scaling issues which prohibit us from polling caches
> faster.  In order to solve our impending scaling issues as well as improve
> our ability to make better and faster health decisions, Traffic Monitor
> needs to run in a distributed fashion instead of an all or nothing
> fashion.
>
> Furthermore, there is a growing need to provide support for external
> monitoring systems in Traffic Monitor.  Traffic Monitor needs to be able to
> use other monitoring systems to aid in the health decision process. While
> this could be solved in today's Traffic Monitor, it is best to solve this
> problem in conjunction with making the polling distributed.
> *Business Justification*
>
> In order to provide the best customer experience possible, we need to have
> a robust and timely health monitoring system.  While Traffic Monitor has
> been sufficient to date, we need to make sure that we are adapting to meet
> the needs of the near future and we need to make sure that we are evolving
> to continue to meet customers needs.  These changes to Traffic Monitor are
> imperative to providing as near real time as possible cache health data on
> our ever increasing in scale of the CDN.
> *Business Requirements*
>
>    - Traffic Monitor MUST be capable of being configured to monitor a
>    portion of a CDN
>    - Traffic Monitor MUST be capable of being configured to monitor all
>    caches in a CDN
>    - Traffic Monitor MUST provide an API to get the health status of ALL
>    caches in the CDN
>    - Traffic Monitor MUST provide an API to get statistics (from e.g.
>    astats data) generated by ALL caches in the CDN. This does not include
> any
>    statistics generated by external monitoring systems.
>    - Traffic Monitor MUST log all requests to its API including AT LEAST
>    the following information: timestamp, client IP, resource requested,
>    response code, response reason, time to serve.
>    - Traffic Monitor MUST provide an API to get the status of caches it
>    monitors
>    - Traffic Monitor MUST log all health state changes for a cache whether
>    the decision is made internally or from an external system.
>    - Traffic Monitor MUST provide the ability to have more than 1 Traffic
>    Monitor monitor the same cache and come to consensus on the health of
> the
>    cache.
>    - Traffic Monitor SHOULD provide the way to configure more than one
>    subset of caches to monitor – e.g. as a primary and backup.
>    - Traffic Monitor SHOULD provide a way to integrate with external
>    services to provide additional cache health monitoring
>    - Traffic Monitor SHOULD have the capability to provide a non-boolean
>    health score for a cache - e.g. a number between 0 - 100
>    - Traffic Monitor MAY be decoupled from Traffic Ops for configuration
>    generation
>