You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@trafficcontrol.apache.org by Robert O Butts <ro...@apache.org> on 2020/04/09 22:57:44 UTC

ORT Rewrite Proposal

I've made a Blueprint proposing to rewrite ORT:
https://github.com/apache/trafficcontrol/pull/4628

If you have opinions on ORT, please read and provide feedback.

In a nutshell, it's proposing to rewrite ORT in Go, in the "UNIX
Philosophy" of small, "do one thing" apps.

Importantly, the proposal **removes** the following ORT features:

chkconfig - CentOS 7+ and SystemD don't use chkconfig, and moreover our
default Profile runlevel is wrong and broken. But my knowledge of
CentOS,SystemD,chkconfig,runlevels isn't perfect, if I'm mistaken about
this and you're using ORT to set chkconfig, please let us know ASAP.

ntpd - ORT today has code to set ntpd config and restart the ntpd service.
I have no idea why it was ever in charge of this, but this clearly seems to
be the system's job, not ORT or TC's.

interactive mode - I asked around, and couldn't find anyone using this.
Does anyone use it? And feel it's essential to keep in ORT? And also feel
that the way this proposal breaks up the app so that it's easy to request
and compare files before applying them isn't sufficient?

reval mode - This was put in because ORT was slow. ORT in master now takes
10-20s on our large CDN. Moreover, "reval" mode is no longer significantly
faster than just applying everything. Does anyone feel otherwise?

report mode - The functionality here is valuable. But intention here is to
replace "ORT report mode" with a pipelined set of app calls or a script to
do the same thing. I.e. because it's "UNIX-Style" you can just "ort-to-get
| ort-make-configs | ort-diff".

package installation - This is the biggest feature the proposal removes,
and probably the most controversial. The thought is: this isn't something
ORT or Traffic Control should be doing. The same thing that manages the
physical machine and/or operating system -- whether that's Ansible, Puppet,
Chef, or a human System Administrator -- should be installing the OS
packages for ATS and its plugins, just like it manages all the other
packages on your system. ORT and TC should deploy configuration, not
install things.

So yeah, feedback welcome. Feel free to post it on the list here or the
blueprint PR on github.

Thanks,

Re: ORT Rewrite Proposal

Posted by Robert O Butts <ro...@apache.org>.

>`yum install -y $(ort-pkg --upgrades)`

I like that idea. Not convinced that's exactly the best way to do it. E.g.
maybe outputting a file after a config run would be better? Or something.

But I'm definitely +1 on putting some thought it.

My biggest concern, is that it still requires package info be in TO. Which
is part of the goal, to get away from that. And specifically, it means
having your package data in 2 places, TO and your server manager
(Ansible/Puppet/etc). It would be ideal to avoid that. I'm afraid the only
way might be something like a database of ATS features/plugins to versions;
the complexity of that feels like it's getting out of hand.


On Fri, Apr 10, 2020 at 3:38 PM Chris Lemmons <al...@gmail.com> wrote:

> I am +1 on most of this, though I have specific comments on
> implementation details I'll add in a PR review where I can tag the
> conversation to specific lines.
>
> But I do think that cache-specific packages might still make sense to
> be controlled by a package manager. ORT is responsible for producing
> config files for very specific versions of cache software and plugins;
> it makes sense to me that it also be able to manage the versions of
> that software and plugins. Likewise, cache-software upgrades have to
> happen in concert with other config upgrades. Since they need to be
> synchronous, it makes sense to have a single tool managing them. If
> we're going to suggest that other system packaging tools handle cache
> upgrades, we should at least think carefully about how we're going to
> let systems know when it's safe to perform those upgrades.
>
> In the context of the Unix philosophy, I would expect to run a command
> like `yum install -y $(ort-pkg --upgrades)` or `ort-pkg --upgrades |
> xargs yum install -y` or maybe even something like `ort-pkg --upgrades
> --yum-script | sh`. But ORT is the tool in the best position to tell
> me what cache-related packages I need to make sure are installed. If a
> given server is assigned a delivery service that requires a special
> plugin, it would be really nice if ORT could ensure that the plugin
> was present and of an adequate version. Otherwise, you have two
> separate systems trying to keep the same information in sync.
>
> Another advantage of this approach is that ORT doesn't need elevated
> privileges to install packages. And it's easy to swap in alternate OS
> options: `emerge $(ort-pkg --upgrades)` or `ort-pkg --upgrades
> --portage-script | sh` for example.
>
> On Thu, Apr 9, 2020 at 5:46 PM Jeremy Mitchell <mi...@gmail.com>
> wrote:
> >
> > I have a feeling with the introduction of flexible topologies (in
> addition
> > to or in place of our currently static topologies) -
> > https://github.com/apache/trafficcontrol/pull/4537 - we might need to
> > rethink how content invalidations (revalidations) work anyhow. Besides
> > that, I'm +1 on an ORT rewrite if not for anything more than the ~2700
> > lines of perl -
> >
> https://github.com/apache/trafficcontrol/blob/master/traffic_ops/ort/traffic_ops_ort.pl
> > -
> > are almost unmaintainable/inextensible.
> >
> > Jeremy
> >
> > On Thu, Apr 9, 2020 at 5:16 PM Robert O Butts <ro...@apache.org> wrote:
> >
> > > >If you remove reval then everything goes back to tiered updates again.
> > >
> > > Good point. I think we can make this smart enough to apply what it can
> > > (everything but reval?) without waiting for parents. Definitely worth
> > > putting on the to-do list.
> > >
> > >
> > > On Thu, Apr 9, 2020 at 5:06 PM Derek Gelinas <mr...@gmail.com>
> wrote:
> > >
> > > > I’m +1 on all of these. Unsure about package control, though. I
> suspect
> > > > that there are still some people using that one out in the world.
> > > >
> > > > I’ve never once heard of interactive mode being used, I like report
> mode
> > > > but I’m fine as long as there’s an alternative. Agree about reval
> mode.
> > > It
> > > > was made to solve a very specific scaling issue.  That said, it also
> > > opened
> > > > up caches to update configs free of having to wait for parents, so
> there
> > > > could be some value to be had, there.  If you remove reval then
> > > everything
> > > > goes back to tiered updates again.
> > > >
> > > > Derek
> > > >
> > > > > On Apr 9, 2020, at 6:57 PM, Robert O Butts <ro...@apache.org> wrote:
> > > > >
> > > > > I've made a Blueprint proposing to rewrite ORT:
> > > > > https://github.com/apache/trafficcontrol/pull/4628
> > > > >
> > > > > If you have opinions on ORT, please read and provide feedback.
> > > > >
> > > > > In a nutshell, it's proposing to rewrite ORT in Go, in the "UNIX
> > > > > Philosophy" of small, "do one thing" apps.
> > > > >
> > > > > Importantly, the proposal **removes** the following ORT features:
> > > > >
> > > > > chkconfig - CentOS 7+ and SystemD don't use chkconfig, and
> moreover our
> > > > > default Profile runlevel is wrong and broken. But my knowledge of
> > > > > CentOS,SystemD,chkconfig,runlevels isn't perfect, if I'm mistaken
> about
> > > > > this and you're using ORT to set chkconfig, please let us know
> ASAP.
> > > > >
> > > > > ntpd - ORT today has code to set ntpd config and restart the ntpd
> > > > service.
> > > > > I have no idea why it was ever in charge of this, but this clearly
> > > seems
> > > > to
> > > > > be the system's job, not ORT or TC's.
> > > > >
> > > > > interactive mode - I asked around, and couldn't find anyone using
> this.
> > > > > Does anyone use it? And feel it's essential to keep in ORT? And
> also
> > > feel
> > > > > that the way this proposal breaks up the app so that it's easy to
> > > request
> > > > > and compare files before applying them isn't sufficient?
> > > > >
> > > > > reval mode - This was put in because ORT was slow. ORT in master
> now
> > > > takes
> > > > > 10-20s on our large CDN. Moreover, "reval" mode is no longer
> > > > significantly
> > > > > faster than just applying everything. Does anyone feel otherwise?
> > > > >
> > > > > report mode - The functionality here is valuable. But intention
> here is
> > > > to
> > > > > replace "ORT report mode" with a pipelined set of app calls or a
> script
> > > > to
> > > > > do the same thing. I.e. because it's "UNIX-Style" you can just
> > > > "ort-to-get
> > > > > | ort-make-configs | ort-diff".
> > > > >
> > > > > package installation - This is the biggest feature the proposal
> > > removes,
> > > > > and probably the most controversial. The thought is: this isn't
> > > something
> > > > > ORT or Traffic Control should be doing. The same thing that
> manages the
> > > > > physical machine and/or operating system -- whether that's Ansible,
> > > > Puppet,
> > > > > Chef, or a human System Administrator -- should be installing the
> OS
> > > > > packages for ATS and its plugins, just like it manages all the
> other
> > > > > packages on your system. ORT and TC should deploy configuration,
> not
> > > > > install things.
> > > > >
> > > > > So yeah, feedback welcome. Feel free to post it on the list here
> or the
> > > > > blueprint PR on github.
> > > > >
> > > > > Thanks,
> > > >
> > > >
> > >
>

Re: ORT Rewrite Proposal

Posted by Chris Lemmons <al...@gmail.com>.

I am +1 on most of this, though I have specific comments on
implementation details I'll add in a PR review where I can tag the
conversation to specific lines.

But I do think that cache-specific packages might still make sense to
be controlled by a package manager. ORT is responsible for producing
config files for very specific versions of cache software and plugins;
it makes sense to me that it also be able to manage the versions of
that software and plugins. Likewise, cache-software upgrades have to
happen in concert with other config upgrades. Since they need to be
synchronous, it makes sense to have a single tool managing them. If
we're going to suggest that other system packaging tools handle cache
upgrades, we should at least think carefully about how we're going to
let systems know when it's safe to perform those upgrades.

In the context of the Unix philosophy, I would expect to run a command
like `yum install -y $(ort-pkg --upgrades)` or `ort-pkg --upgrades |
xargs yum install -y` or maybe even something like `ort-pkg --upgrades
--yum-script | sh`. But ORT is the tool in the best position to tell
me what cache-related packages I need to make sure are installed. If a
given server is assigned a delivery service that requires a special
plugin, it would be really nice if ORT could ensure that the plugin
was present and of an adequate version. Otherwise, you have two
separate systems trying to keep the same information in sync.

Another advantage of this approach is that ORT doesn't need elevated
privileges to install packages. And it's easy to swap in alternate OS
options: `emerge $(ort-pkg --upgrades)` or `ort-pkg --upgrades
--portage-script | sh` for example.

On Thu, Apr 9, 2020 at 5:46 PM Jeremy Mitchell <mi...@gmail.com> wrote:
>
> I have a feeling with the introduction of flexible topologies (in addition
> to or in place of our currently static topologies) -
> https://github.com/apache/trafficcontrol/pull/4537 - we might need to
> rethink how content invalidations (revalidations) work anyhow. Besides
> that, I'm +1 on an ORT rewrite if not for anything more than the ~2700
> lines of perl -
> https://github.com/apache/trafficcontrol/blob/master/traffic_ops/ort/traffic_ops_ort.pl
> -
> are almost unmaintainable/inextensible.
>
> Jeremy
>
> On Thu, Apr 9, 2020 at 5:16 PM Robert O Butts <ro...@apache.org> wrote:
>
> > >If you remove reval then everything goes back to tiered updates again.
> >
> > Good point. I think we can make this smart enough to apply what it can
> > (everything but reval?) without waiting for parents. Definitely worth
> > putting on the to-do list.
> >
> >
> > On Thu, Apr 9, 2020 at 5:06 PM Derek Gelinas <mr...@gmail.com> wrote:
> >
> > > I’m +1 on all of these. Unsure about package control, though. I suspect
> > > that there are still some people using that one out in the world.
> > >
> > > I’ve never once heard of interactive mode being used, I like report mode
> > > but I’m fine as long as there’s an alternative. Agree about reval mode.
> > It
> > > was made to solve a very specific scaling issue.  That said, it also
> > opened
> > > up caches to update configs free of having to wait for parents, so there
> > > could be some value to be had, there.  If you remove reval then
> > everything
> > > goes back to tiered updates again.
> > >
> > > Derek
> > >
> > > > On Apr 9, 2020, at 6:57 PM, Robert O Butts <ro...@apache.org> wrote:
> > > >
> > > > I've made a Blueprint proposing to rewrite ORT:
> > > > https://github.com/apache/trafficcontrol/pull/4628
> > > >
> > > > If you have opinions on ORT, please read and provide feedback.
> > > >
> > > > In a nutshell, it's proposing to rewrite ORT in Go, in the "UNIX
> > > > Philosophy" of small, "do one thing" apps.
> > > >
> > > > Importantly, the proposal **removes** the following ORT features:
> > > >
> > > > chkconfig - CentOS 7+ and SystemD don't use chkconfig, and moreover our
> > > > default Profile runlevel is wrong and broken. But my knowledge of
> > > > CentOS,SystemD,chkconfig,runlevels isn't perfect, if I'm mistaken about
> > > > this and you're using ORT to set chkconfig, please let us know ASAP.
> > > >
> > > > ntpd - ORT today has code to set ntpd config and restart the ntpd
> > > service.
> > > > I have no idea why it was ever in charge of this, but this clearly
> > seems
> > > to
> > > > be the system's job, not ORT or TC's.
> > > >
> > > > interactive mode - I asked around, and couldn't find anyone using this.
> > > > Does anyone use it? And feel it's essential to keep in ORT? And also
> > feel
> > > > that the way this proposal breaks up the app so that it's easy to
> > request
> > > > and compare files before applying them isn't sufficient?
> > > >
> > > > reval mode - This was put in because ORT was slow. ORT in master now
> > > takes
> > > > 10-20s on our large CDN. Moreover, "reval" mode is no longer
> > > significantly
> > > > faster than just applying everything. Does anyone feel otherwise?
> > > >
> > > > report mode - The functionality here is valuable. But intention here is
> > > to
> > > > replace "ORT report mode" with a pipelined set of app calls or a script
> > > to
> > > > do the same thing. I.e. because it's "UNIX-Style" you can just
> > > "ort-to-get
> > > > | ort-make-configs | ort-diff".
> > > >
> > > > package installation - This is the biggest feature the proposal
> > removes,
> > > > and probably the most controversial. The thought is: this isn't
> > something
> > > > ORT or Traffic Control should be doing. The same thing that manages the
> > > > physical machine and/or operating system -- whether that's Ansible,
> > > Puppet,
> > > > Chef, or a human System Administrator -- should be installing the OS
> > > > packages for ATS and its plugins, just like it manages all the other
> > > > packages on your system. ORT and TC should deploy configuration, not
> > > > install things.
> > > >
> > > > So yeah, feedback welcome. Feel free to post it on the list here or the
> > > > blueprint PR on github.
> > > >
> > > > Thanks,
> > >
> > >
> >

Re: ORT Rewrite Proposal

Posted by Jeremy Mitchell <mi...@gmail.com>.

I have a feeling with the introduction of flexible topologies (in addition
to or in place of our currently static topologies) -
https://github.com/apache/trafficcontrol/pull/4537 - we might need to
rethink how content invalidations (revalidations) work anyhow. Besides
that, I'm +1 on an ORT rewrite if not for anything more than the ~2700
lines of perl -
https://github.com/apache/trafficcontrol/blob/master/traffic_ops/ort/traffic_ops_ort.pl
-
are almost unmaintainable/inextensible.

Jeremy

On Thu, Apr 9, 2020 at 5:16 PM Robert O Butts <ro...@apache.org> wrote:

> >If you remove reval then everything goes back to tiered updates again.
>
> Good point. I think we can make this smart enough to apply what it can
> (everything but reval?) without waiting for parents. Definitely worth
> putting on the to-do list.
>
>
> On Thu, Apr 9, 2020 at 5:06 PM Derek Gelinas <mr...@gmail.com> wrote:
>
> > I’m +1 on all of these. Unsure about package control, though. I suspect
> > that there are still some people using that one out in the world.
> >
> > I’ve never once heard of interactive mode being used, I like report mode
> > but I’m fine as long as there’s an alternative. Agree about reval mode.
> It
> > was made to solve a very specific scaling issue.  That said, it also
> opened
> > up caches to update configs free of having to wait for parents, so there
> > could be some value to be had, there.  If you remove reval then
> everything
> > goes back to tiered updates again.
> >
> > Derek
> >
> > > On Apr 9, 2020, at 6:57 PM, Robert O Butts <ro...@apache.org> wrote:
> > >
> > > I've made a Blueprint proposing to rewrite ORT:
> > > https://github.com/apache/trafficcontrol/pull/4628
> > >
> > > If you have opinions on ORT, please read and provide feedback.
> > >
> > > In a nutshell, it's proposing to rewrite ORT in Go, in the "UNIX
> > > Philosophy" of small, "do one thing" apps.
> > >
> > > Importantly, the proposal **removes** the following ORT features:
> > >
> > > chkconfig - CentOS 7+ and SystemD don't use chkconfig, and moreover our
> > > default Profile runlevel is wrong and broken. But my knowledge of
> > > CentOS,SystemD,chkconfig,runlevels isn't perfect, if I'm mistaken about
> > > this and you're using ORT to set chkconfig, please let us know ASAP.
> > >
> > > ntpd - ORT today has code to set ntpd config and restart the ntpd
> > service.
> > > I have no idea why it was ever in charge of this, but this clearly
> seems
> > to
> > > be the system's job, not ORT or TC's.
> > >
> > > interactive mode - I asked around, and couldn't find anyone using this.
> > > Does anyone use it? And feel it's essential to keep in ORT? And also
> feel
> > > that the way this proposal breaks up the app so that it's easy to
> request
> > > and compare files before applying them isn't sufficient?
> > >
> > > reval mode - This was put in because ORT was slow. ORT in master now
> > takes
> > > 10-20s on our large CDN. Moreover, "reval" mode is no longer
> > significantly
> > > faster than just applying everything. Does anyone feel otherwise?
> > >
> > > report mode - The functionality here is valuable. But intention here is
> > to
> > > replace "ORT report mode" with a pipelined set of app calls or a script
> > to
> > > do the same thing. I.e. because it's "UNIX-Style" you can just
> > "ort-to-get
> > > | ort-make-configs | ort-diff".
> > >
> > > package installation - This is the biggest feature the proposal
> removes,
> > > and probably the most controversial. The thought is: this isn't
> something
> > > ORT or Traffic Control should be doing. The same thing that manages the
> > > physical machine and/or operating system -- whether that's Ansible,
> > Puppet,
> > > Chef, or a human System Administrator -- should be installing the OS
> > > packages for ATS and its plugins, just like it manages all the other
> > > packages on your system. ORT and TC should deploy configuration, not
> > > install things.
> > >
> > > So yeah, feedback welcome. Feel free to post it on the list here or the
> > > blueprint PR on github.
> > >
> > > Thanks,
> >
> >
>

Re: ORT Rewrite Proposal

Posted by Robert O Butts <ro...@apache.org>.

>If you remove reval then everything goes back to tiered updates again.

Good point. I think we can make this smart enough to apply what it can
(everything but reval?) without waiting for parents. Definitely worth
putting on the to-do list.


On Thu, Apr 9, 2020 at 5:06 PM Derek Gelinas <mr...@gmail.com> wrote:

> I’m +1 on all of these. Unsure about package control, though. I suspect
> that there are still some people using that one out in the world.
>
> I’ve never once heard of interactive mode being used, I like report mode
> but I’m fine as long as there’s an alternative. Agree about reval mode.  It
> was made to solve a very specific scaling issue.  That said, it also opened
> up caches to update configs free of having to wait for parents, so there
> could be some value to be had, there.  If you remove reval then everything
> goes back to tiered updates again.
>
> Derek
>
> > On Apr 9, 2020, at 6:57 PM, Robert O Butts <ro...@apache.org> wrote:
> >
> > I've made a Blueprint proposing to rewrite ORT:
> > https://github.com/apache/trafficcontrol/pull/4628
> >
> > If you have opinions on ORT, please read and provide feedback.
> >
> > In a nutshell, it's proposing to rewrite ORT in Go, in the "UNIX
> > Philosophy" of small, "do one thing" apps.
> >
> > Importantly, the proposal **removes** the following ORT features:
> >
> > chkconfig - CentOS 7+ and SystemD don't use chkconfig, and moreover our
> > default Profile runlevel is wrong and broken. But my knowledge of
> > CentOS,SystemD,chkconfig,runlevels isn't perfect, if I'm mistaken about
> > this and you're using ORT to set chkconfig, please let us know ASAP.
> >
> > ntpd - ORT today has code to set ntpd config and restart the ntpd
> service.
> > I have no idea why it was ever in charge of this, but this clearly seems
> to
> > be the system's job, not ORT or TC's.
> >
> > interactive mode - I asked around, and couldn't find anyone using this.
> > Does anyone use it? And feel it's essential to keep in ORT? And also feel
> > that the way this proposal breaks up the app so that it's easy to request
> > and compare files before applying them isn't sufficient?
> >
> > reval mode - This was put in because ORT was slow. ORT in master now
> takes
> > 10-20s on our large CDN. Moreover, "reval" mode is no longer
> significantly
> > faster than just applying everything. Does anyone feel otherwise?
> >
> > report mode - The functionality here is valuable. But intention here is
> to
> > replace "ORT report mode" with a pipelined set of app calls or a script
> to
> > do the same thing. I.e. because it's "UNIX-Style" you can just
> "ort-to-get
> > | ort-make-configs | ort-diff".
> >
> > package installation - This is the biggest feature the proposal removes,
> > and probably the most controversial. The thought is: this isn't something
> > ORT or Traffic Control should be doing. The same thing that manages the
> > physical machine and/or operating system -- whether that's Ansible,
> Puppet,
> > Chef, or a human System Administrator -- should be installing the OS
> > packages for ATS and its plugins, just like it manages all the other
> > packages on your system. ORT and TC should deploy configuration, not
> > install things.
> >
> > So yeah, feedback welcome. Feel free to post it on the list here or the
> > blueprint PR on github.
> >
> > Thanks,
>
>

Re: ORT Rewrite Proposal

Posted by Derek Gelinas <mr...@gmail.com>.

I’m +1 on all of these. Unsure about package control, though. I suspect that there are still some people using that one out in the world.

I’ve never once heard of interactive mode being used, I like report mode but I’m fine as long as there’s an alternative. Agree about reval mode.  It was made to solve a very specific scaling issue.  That said, it also opened up caches to update configs free of having to wait for parents, so there could be some value to be had, there.  If you remove reval then everything goes back to tiered updates again.

Derek

> On Apr 9, 2020, at 6:57 PM, Robert O Butts <ro...@apache.org> wrote:
> 
> I've made a Blueprint proposing to rewrite ORT:
> https://github.com/apache/trafficcontrol/pull/4628
> 
> If you have opinions on ORT, please read and provide feedback.
> 
> In a nutshell, it's proposing to rewrite ORT in Go, in the "UNIX
> Philosophy" of small, "do one thing" apps.
> 
> Importantly, the proposal **removes** the following ORT features:
> 
> chkconfig - CentOS 7+ and SystemD don't use chkconfig, and moreover our
> default Profile runlevel is wrong and broken. But my knowledge of
> CentOS,SystemD,chkconfig,runlevels isn't perfect, if I'm mistaken about
> this and you're using ORT to set chkconfig, please let us know ASAP.
> 
> ntpd - ORT today has code to set ntpd config and restart the ntpd service.
> I have no idea why it was ever in charge of this, but this clearly seems to
> be the system's job, not ORT or TC's.
> 
> interactive mode - I asked around, and couldn't find anyone using this.
> Does anyone use it? And feel it's essential to keep in ORT? And also feel
> that the way this proposal breaks up the app so that it's easy to request
> and compare files before applying them isn't sufficient?
> 
> reval mode - This was put in because ORT was slow. ORT in master now takes
> 10-20s on our large CDN. Moreover, "reval" mode is no longer significantly
> faster than just applying everything. Does anyone feel otherwise?
> 
> report mode - The functionality here is valuable. But intention here is to
> replace "ORT report mode" with a pipelined set of app calls or a script to
> do the same thing. I.e. because it's "UNIX-Style" you can just "ort-to-get
> | ort-make-configs | ort-diff".
> 
> package installation - This is the biggest feature the proposal removes,
> and probably the most controversial. The thought is: this isn't something
> ORT or Traffic Control should be doing. The same thing that manages the
> physical machine and/or operating system -- whether that's Ansible, Puppet,
> Chef, or a human System Administrator -- should be installing the OS
> packages for ATS and its plugins, just like it manages all the other
> packages on your system. ORT and TC should deploy configuration, not
> install things.
> 
> So yeah, feedback welcome. Feel free to post it on the list here or the
> blueprint PR on github.
> 
> Thanks,

Re: ORT Rewrite Proposal

Posted by Robert O Butts <ro...@apache.org>.

>Communicating between 11 different processes via stdin/stdout and exit
codes, even if the processes themselves are relatively simple, is fairly
complex as a whole.

>I don't really see a problem with implementing it as a single
well-designed binary

It's not that there's a "problem." The question is, if we're doing a
rewrite, which design has more pros and fewer cons?

Personally, I think the UNIX-Style has more pros. Small, self-contained
apps are easier to develop, easier to test, easier to compose, more
powerful, and more flexible.

Self-contained executables for each logical "thing" makes each of those
things much easier to Integration Test itself. Things which would otherwise
be Unit Tests, which are more artificial and test less of the complete
system. While at the same time, Integration Tests of the full pipeline
aren't any harder. The Integration Test Framework can call the Aggregator
or pipeline, just like it would the monolith, and verify the output.

>stdin/stdout

The input and output can be tested just like functions would be, in a
monolith. In fact, functions can be very difficult to test if they have
side effects, which are very easy and natural to write in most
OO/Imperative languages. UNIX-style apps make that impossible; or at least
very difficult. The more natural way of writing apps is making them just
return output, and not change random things on the system, guiding
developers toward more "pure" functions/apps. I can't prove that, of
course; but I really do think small Unix-Style apps guide developers toward
fewer side effects, making testing and verifying correctness much easier. I
think that's another big advantage of the Unix Philosophy.

>exit codes

Exit codes should always be 0 unless there's an error. The tool
(Aggregator) or operator calling the pipeline should return the error if
any component fails (a script can do this with "set -o pipefail"). That
isn't any more difficult to use than a monolith that checks "if err != nil"
for every step.

In fact, it's arguably easier than the monolith, because you don't have to
do the manual error checks for every call, they're automatically propagated
up. If there's an error in a pipeline, the error text returned is returned
for the whole pipeline, and in a test the user is given the exact failure
text to go track down.

UNIX-style executables are easier to develop, too. They can be written in
different languages, if one language is more optimal for a particular task.
And they can be more easily worked on separately by different developers,
with fewer conflicts. It's clearer what each "one thing" is; where in a
single large app, it's very easy to blur lines and make confusing code
around individual things or concepts. And more developers can work on ORT
concurrently, without conflict. Because each person only has to modify
their app's input and output, while everyone else does the other apps, in
parallel.

They're also easier to compose. Operators can call and pipe whatever they
need. Where with a Monolith, only what's exposed is available. For example,
ORT today _could_ expose a diff argument. But it doesn't. Right now,
operators have no way to run ORT's custom diff on their own input. Or,
suppose we didn't expose any kind of "caching," but some TC operator needed
to cache for a minute, to reduce load. With the Monolith, it's just
impossible, unless devs add it, or the operator dives in and modifies the
large and complex ORT codebase. But with UNIX-Style apps, you can write a
quick shell script to call the ort-get-to-data app, save it to a file, if
the file's age is short enough don't call again, and then call the rest of
the pipeline. There are innumerable things like that which are quick and
easy with small apps, and difficult or impossible with a monolith.

>I would also like to bring up the idea that we really need to change
>ORT's "pull" paradigm, or at least make the "pull" more efficient so
>that we don't have thousands of ORT instances all making the same
> requests to TO, with TO having to hit the DB for every request even
>though nothing has actually changed. Since we control ORT we have
> nearly 100% of control over all TO API requests made, yet we have a
>design that self-DDOSes itself by default right now. Do we want to
>tackle that problem as part of this redesign, or is that out of scope?

I would vote out-of-scope, that we discuss that in a different thread as a
different project.

Personally, IMO "pull," the HTTP Client-Server model, advantages far
outweigh "push." The entire internet is built on Client-Server, for good
reason. Push has tons of issues like being Stateful, needing to "register"
clients, needing "brokers," more points of failure. The problem with ORT/TO
scalability is because we don't implement well-known standards for updates,
namely If-Modified-Since et al (I hope the irony isn't lost on anyone). IMO
the solution is to implement IMS on Traffic Ops, and then make the "Traffic
Ops Requestor" do proper IMS requests. Both network and database calls for
IMS are tiny, and it solves the issue without all the disadvantages and
costs of "push."

But I think that project is orthogonal and independent of this.

On Mon, Apr 13, 2020 at 7:06 PM ocket 8888 <oc...@gmail.com> wrote:

> For what it's worth, I'd be +1 on re-examining "push" vs "pull" for ORT.
>
> On Mon, Apr 13, 2020, 16:46 Rawlin Peters <ra...@apache.org> wrote:
>
> > I'm generally +1 on redesigning ORT with the removal of the features
> > you mentioned, but the one thing that worries me is the number of
> > unique binaries/executables involved (potentially 11). Communicating
> > between 11 different processes via stdin/stdout and exit codes, even
> > if the processes themselves are relatively simple, is fairly complex
> > as a whole. IMO I don't really see a problem with implementing it as a
> > single well-designed binary -- if it's Go, each proposed binary could
> > just be its own package instead, with each package only exporting one
> > high-level function. The main func would then be the "Aggregator" that
> > simply calls each package's public function in turn, passing the
> > output of one into the input of the next, checking for errors at each
> > step. I think that would make it much easier to debug and test as a
> > whole.
> >
> > I would also like to bring up the idea that we really need to change
> > ORT's "pull" paradigm, or at least make the "pull" more efficient so
> > that we don't have thousands of ORT instances all making the same
> > requests to TO, with TO having to hit the DB for every request even
> > though nothing has actually changed. Since we control ORT we have
> > nearly 100% of control over all TO API requests made, yet we have a
> > design that self-DDOSes itself by default right now. Do we want to
> > tackle that problem as part of this redesign, or is that out of scope?
> >
> > - Rawlin
> >
> > On Thu, Apr 9, 2020 at 4:57 PM Robert O Butts <ro...@apache.org> wrote:
> > >
> > > I've made a Blueprint proposing to rewrite ORT:
> > > https://github.com/apache/trafficcontrol/pull/4628
> > >
> > > If you have opinions on ORT, please read and provide feedback.
> > >
> > > In a nutshell, it's proposing to rewrite ORT in Go, in the "UNIX
> > > Philosophy" of small, "do one thing" apps.
> > >
> > > Importantly, the proposal **removes** the following ORT features:
> > >
> > > chkconfig - CentOS 7+ and SystemD don't use chkconfig, and moreover our
> > > default Profile runlevel is wrong and broken. But my knowledge of
> > > CentOS,SystemD,chkconfig,runlevels isn't perfect, if I'm mistaken about
> > > this and you're using ORT to set chkconfig, please let us know ASAP.
> > >
> > > ntpd - ORT today has code to set ntpd config and restart the ntpd
> > service.
> > > I have no idea why it was ever in charge of this, but this clearly
> seems
> > to
> > > be the system's job, not ORT or TC's.
> > >
> > > interactive mode - I asked around, and couldn't find anyone using this.
> > > Does anyone use it? And feel it's essential to keep in ORT? And also
> feel
> > > that the way this proposal breaks up the app so that it's easy to
> request
> > > and compare files before applying them isn't sufficient?
> > >
> > > reval mode - This was put in because ORT was slow. ORT in master now
> > takes
> > > 10-20s on our large CDN. Moreover, "reval" mode is no longer
> > significantly
> > > faster than just applying everything. Does anyone feel otherwise?
> > >
> > > report mode - The functionality here is valuable. But intention here is
> > to
> > > replace "ORT report mode" with a pipelined set of app calls or a script
> > to
> > > do the same thing. I.e. because it's "UNIX-Style" you can just
> > "ort-to-get
> > > | ort-make-configs | ort-diff".
> > >
> > > package installation - This is the biggest feature the proposal
> removes,
> > > and probably the most controversial. The thought is: this isn't
> something
> > > ORT or Traffic Control should be doing. The same thing that manages the
> > > physical machine and/or operating system -- whether that's Ansible,
> > Puppet,
> > > Chef, or a human System Administrator -- should be installing the OS
> > > packages for ATS and its plugins, just like it manages all the other
> > > packages on your system. ORT and TC should deploy configuration, not
> > > install things.
> > >
> > > So yeah, feedback welcome. Feel free to post it on the list here or the
> > > blueprint PR on github.
> > >
> > > Thanks,
> >
>

Re: ORT Rewrite Proposal

Posted by Chris Lemmons <al...@gmail.com>.

> the idea that we really need to change ORT's "pull" paradigm

> Rawlin never even mentioned the word "Push" :)

Noted. I'll open a ticket for the Quantum Tunneled Config. :P

But I think you're spot on when you identify that the issue is one of
scaling and not specifically the architecture. I'm very interested to
see your thoughts on scaling solutions.

> Config Generator and Config File Preprocessor should be 1 thing that takes in TO data and spits out config files.

Agree. I noted this in the PR review.

> Server config readiness and ATS plugin readiness can just be a "system readiness verifier"

These two components are likely to be very small, so I don't know that
it matters all that much. But they also probably won't share much, if
any, logic. I also don't see much benefit to combining them. I think
the question is, "Is there a reasonable use case where an operator or
other component wants to know the answer to one of those questions
(server config ready or ats plugin ready) but not the other. If not,
combining seems fine.

> The restart determiner and service reloader can probably be one thing that takes flags, maybe a  "report" mode.

Yeah, modeling this after systemctl and related tools seems
reasonable. `tcctl status`, `tcctl config reload`, `tcctl config
check`, and similar ideas, perhaps?

> I also think the components should be packaged together

100% agree. Managing various versions would create all sorts of fun
operational bugs for folks. And I think people would keep them all in
sync in practice anyway.

On Wed, Apr 15, 2020 at 3:57 PM Dave Neuman <ne...@apache.org> wrote:
>
> Rawlin never even mentioned the word "Push" :) he was also referring to the
> potential of thousands of clients request many of the same end points all
> at once.  Our CDN is good at that, Traffic Ops is not.  Anyway, I think
> that problem can and should be solved outside the scope of the ORT re-write
> (for now).
>
> I agree that having smaller components that do specific things is generally
> a good thing, I also think there is some diminishing return and having too
> many causes more problems then it is worth.  I also think the components
> should be packaged together so that we trying to manage 11 (or whatever)
> different RPMS.
>
> There are some proposed executables that I think we can consolidate:
> - Config Generator and Config File Preprocessor should be 1 thing that
> takes in TO data and spits out config files.
> - Server config readiness and ATS plugin readiness can just be a "system
> readiness verifier"
> - The restart determiner and service reloader can probably be one thing
> that takes flags, maybe a  "report" mode.
>
>
> Thanks,
> Dave
>
>
> On Mon, Apr 13, 2020 at 8:32 PM Chris Lemmons <al...@gmail.com> wrote:
>
> > > Communicating between 11 different processes via stdin/stdout and exit
> > codes, even if the processes themselves are relatively simple, is fairly
> > complex as a whole.
> >
> > Yes and no. Using multiple processes doesn't actually reduce the
> > complexity of the task as a whole. What it does do is make it crystal
> > clear to a maintainer or operator what information a given process has
> > and what it produces.
> >
> > Sure, there are other ways to do pure functional components, but
> > everybody knows those components never actually wind up purely
> > functional. It's very often impractical. Separate process spaces help
> > slice the problem into manageable pieces so that it's easy to
> > determine how any given component operates.
> >
> > Additionally, we've had to rewrite and replace this beast once and
> > we're basically having to do it all at once. If we break it into
> > smaller components, we allow components to be tested in isolation, we
> > free ourselves to replace only specific subsets in the future, and by
> > using standard streams for communication, we create a baseline we can
> > leverage immediately as a testing hook.
> >
> > The major disadvantage of putting these tasks in separate process
> > space is if there are large chunks of data multiple components need to
> > share. And that's one of the primary things I reviewed the blueprint
> > for. Large data structures aren't being repeatedly serialized and
> > deserialized. Most of the interfaces are fairly small.
> >
> > > I would also like to bring up the idea that we really need to change
> > ORT's "pull" paradigm, or at least make the "pull" more efficient so that
> > we don't have thousands of ORT instances all making the same requests to TO
> >
> > Changing the push/pull of the model doesn't help with efficiency. For
> > that, you need an indication of what config diffs need to be applied.
> > And that particular question isn't any easier to answer in either
> > model, it's the same inputs and outputs.
> >
> > I'm definitely +1 on making this a deployment decision, though. With
> > the model proposed in the Blueprint, it's merely a question of when
> > and how the aggregator is invoked. The ability to invoke from cron,
> > pdsh, ansible, puppet, a shell, or via some future tool we add to the
> > TO UI would be very helpful. Besides, a frequent pull model (aided by
> > the aforementioned differential data transfer we have to develop in
> > any case) is usually as fast as push and much more reliable. I fully
> > expect reasonably fast polls against TC at some point when it can
> > return 304s on 99.9% of requests.
> >
> > Push models also have different security implications that have to be
> > handled. It's doable, especially if TO supports two-way authenticated
> > TLS (it currently doesn't). But pull uses a security model people
> > already have their heads around. If I were deploying a push system,
> > I'd have to think very carefully about how all the various systems
> > were authenticated.
> >
> > And by codifying the roles with clear inputs and outputs, we give CDN
> > operators the flexibility to modify those streams if necessary.
> >
> > Some arbitrary ideas of varying value spring to mind: An operator may
> > have an in-house secrets management system that could serve as input
> > for the systems that require secrets. Or have a sanitizer that turns
> > production data into test data via specialized cleaning routines;
> > which could be invoked between the TO data and the config generator.
> > Or a testing system that mocks inputs and compares outputs to a
> > standard library of answers to validate regressions quickly and
> > efficiently. Or any of a variety of things that aren't mentioned here.
> > Clean inputs and outputs make those specific variations much easier to
> > manage, test, and use. And as our input and output formats change,
> > it'll be much more feasible for operators to manage those systems.
> >
> > Lastly, Gmail is telling me that other people have replied since I
> > started this message. Hopefully, my points are still relevant. :)
> >
> > On Mon, Apr 13, 2020 at 7:06 PM ocket 8888 <oc...@gmail.com> wrote:
> > >
> > > For what it's worth, I'd be +1 on re-examining "push" vs "pull" for ORT.
> > >
> > > On Mon, Apr 13, 2020, 16:46 Rawlin Peters <ra...@apache.org> wrote:
> > >
> > > > I'm generally +1 on redesigning ORT with the removal of the features
> > > > you mentioned, but the one thing that worries me is the number of
> > > > unique binaries/executables involved (potentially 11). Communicating
> > > > between 11 different processes via stdin/stdout and exit codes, even
> > > > if the processes themselves are relatively simple, is fairly complex
> > > > as a whole. IMO I don't really see a problem with implementing it as a
> > > > single well-designed binary -- if it's Go, each proposed binary could
> > > > just be its own package instead, with each package only exporting one
> > > > high-level function. The main func would then be the "Aggregator" that
> > > > simply calls each package's public function in turn, passing the
> > > > output of one into the input of the next, checking for errors at each
> > > > step. I think that would make it much easier to debug and test as a
> > > > whole.
> > > >
> > > > I would also like to bring up the idea that we really need to change
> > > > ORT's "pull" paradigm, or at least make the "pull" more efficient so
> > > > that we don't have thousands of ORT instances all making the same
> > > > requests to TO, with TO having to hit the DB for every request even
> > > > though nothing has actually changed. Since we control ORT we have
> > > > nearly 100% of control over all TO API requests made, yet we have a
> > > > design that self-DDOSes itself by default right now. Do we want to
> > > > tackle that problem as part of this redesign, or is that out of scope?
> > > >
> > > > - Rawlin
> > > >
> > > > On Thu, Apr 9, 2020 at 4:57 PM Robert O Butts <ro...@apache.org> wrote:
> > > > >
> > > > > I've made a Blueprint proposing to rewrite ORT:
> > > > > https://github.com/apache/trafficcontrol/pull/4628
> > > > >
> > > > > If you have opinions on ORT, please read and provide feedback.
> > > > >
> > > > > In a nutshell, it's proposing to rewrite ORT in Go, in the "UNIX
> > > > > Philosophy" of small, "do one thing" apps.
> > > > >
> > > > > Importantly, the proposal **removes** the following ORT features:
> > > > >
> > > > > chkconfig - CentOS 7+ and SystemD don't use chkconfig, and moreover
> > our
> > > > > default Profile runlevel is wrong and broken. But my knowledge of
> > > > > CentOS,SystemD,chkconfig,runlevels isn't perfect, if I'm mistaken
> > about
> > > > > this and you're using ORT to set chkconfig, please let us know ASAP.
> > > > >
> > > > > ntpd - ORT today has code to set ntpd config and restart the ntpd
> > > > service.
> > > > > I have no idea why it was ever in charge of this, but this clearly
> > seems
> > > > to
> > > > > be the system's job, not ORT or TC's.
> > > > >
> > > > > interactive mode - I asked around, and couldn't find anyone using
> > this.
> > > > > Does anyone use it? And feel it's essential to keep in ORT? And also
> > feel
> > > > > that the way this proposal breaks up the app so that it's easy to
> > request
> > > > > and compare files before applying them isn't sufficient?
> > > > >
> > > > > reval mode - This was put in because ORT was slow. ORT in master now
> > > > takes
> > > > > 10-20s on our large CDN. Moreover, "reval" mode is no longer
> > > > significantly
> > > > > faster than just applying everything. Does anyone feel otherwise?
> > > > >
> > > > > report mode - The functionality here is valuable. But intention here
> > is
> > > > to
> > > > > replace "ORT report mode" with a pipelined set of app calls or a
> > script
> > > > to
> > > > > do the same thing. I.e. because it's "UNIX-Style" you can just
> > > > "ort-to-get
> > > > > | ort-make-configs | ort-diff".
> > > > >
> > > > > package installation - This is the biggest feature the proposal
> > removes,
> > > > > and probably the most controversial. The thought is: this isn't
> > something
> > > > > ORT or Traffic Control should be doing. The same thing that manages
> > the
> > > > > physical machine and/or operating system -- whether that's Ansible,
> > > > Puppet,
> > > > > Chef, or a human System Administrator -- should be installing the OS
> > > > > packages for ATS and its plugins, just like it manages all the other
> > > > > packages on your system. ORT and TC should deploy configuration, not
> > > > > install things.
> > > > >
> > > > > So yeah, feedback welcome. Feel free to post it on the list here or
> > the
> > > > > blueprint PR on github.
> > > > >
> > > > > Thanks,
> > > >
> >

Re: ORT Rewrite Proposal

Posted by Dave Neuman <ne...@apache.org>.

Rawlin never even mentioned the word "Push" :) he was also referring to the
potential of thousands of clients request many of the same end points all
at once.  Our CDN is good at that, Traffic Ops is not.  Anyway, I think
that problem can and should be solved outside the scope of the ORT re-write
(for now).

I agree that having smaller components that do specific things is generally
a good thing, I also think there is some diminishing return and having too
many causes more problems then it is worth.  I also think the components
should be packaged together so that we trying to manage 11 (or whatever)
different RPMS.

There are some proposed executables that I think we can consolidate:
- Config Generator and Config File Preprocessor should be 1 thing that
takes in TO data and spits out config files.
- Server config readiness and ATS plugin readiness can just be a "system
readiness verifier"
- The restart determiner and service reloader can probably be one thing
that takes flags, maybe a  "report" mode.


Thanks,
Dave


On Mon, Apr 13, 2020 at 8:32 PM Chris Lemmons <al...@gmail.com> wrote:

> > Communicating between 11 different processes via stdin/stdout and exit
> codes, even if the processes themselves are relatively simple, is fairly
> complex as a whole.
>
> Yes and no. Using multiple processes doesn't actually reduce the
> complexity of the task as a whole. What it does do is make it crystal
> clear to a maintainer or operator what information a given process has
> and what it produces.
>
> Sure, there are other ways to do pure functional components, but
> everybody knows those components never actually wind up purely
> functional. It's very often impractical. Separate process spaces help
> slice the problem into manageable pieces so that it's easy to
> determine how any given component operates.
>
> Additionally, we've had to rewrite and replace this beast once and
> we're basically having to do it all at once. If we break it into
> smaller components, we allow components to be tested in isolation, we
> free ourselves to replace only specific subsets in the future, and by
> using standard streams for communication, we create a baseline we can
> leverage immediately as a testing hook.
>
> The major disadvantage of putting these tasks in separate process
> space is if there are large chunks of data multiple components need to
> share. And that's one of the primary things I reviewed the blueprint
> for. Large data structures aren't being repeatedly serialized and
> deserialized. Most of the interfaces are fairly small.
>
> > I would also like to bring up the idea that we really need to change
> ORT's "pull" paradigm, or at least make the "pull" more efficient so that
> we don't have thousands of ORT instances all making the same requests to TO
>
> Changing the push/pull of the model doesn't help with efficiency. For
> that, you need an indication of what config diffs need to be applied.
> And that particular question isn't any easier to answer in either
> model, it's the same inputs and outputs.
>
> I'm definitely +1 on making this a deployment decision, though. With
> the model proposed in the Blueprint, it's merely a question of when
> and how the aggregator is invoked. The ability to invoke from cron,
> pdsh, ansible, puppet, a shell, or via some future tool we add to the
> TO UI would be very helpful. Besides, a frequent pull model (aided by
> the aforementioned differential data transfer we have to develop in
> any case) is usually as fast as push and much more reliable. I fully
> expect reasonably fast polls against TC at some point when it can
> return 304s on 99.9% of requests.
>
> Push models also have different security implications that have to be
> handled. It's doable, especially if TO supports two-way authenticated
> TLS (it currently doesn't). But pull uses a security model people
> already have their heads around. If I were deploying a push system,
> I'd have to think very carefully about how all the various systems
> were authenticated.
>
> And by codifying the roles with clear inputs and outputs, we give CDN
> operators the flexibility to modify those streams if necessary.
>
> Some arbitrary ideas of varying value spring to mind: An operator may
> have an in-house secrets management system that could serve as input
> for the systems that require secrets. Or have a sanitizer that turns
> production data into test data via specialized cleaning routines;
> which could be invoked between the TO data and the config generator.
> Or a testing system that mocks inputs and compares outputs to a
> standard library of answers to validate regressions quickly and
> efficiently. Or any of a variety of things that aren't mentioned here.
> Clean inputs and outputs make those specific variations much easier to
> manage, test, and use. And as our input and output formats change,
> it'll be much more feasible for operators to manage those systems.
>
> Lastly, Gmail is telling me that other people have replied since I
> started this message. Hopefully, my points are still relevant. :)
>
> On Mon, Apr 13, 2020 at 7:06 PM ocket 8888 <oc...@gmail.com> wrote:
> >
> > For what it's worth, I'd be +1 on re-examining "push" vs "pull" for ORT.
> >
> > On Mon, Apr 13, 2020, 16:46 Rawlin Peters <ra...@apache.org> wrote:
> >
> > > I'm generally +1 on redesigning ORT with the removal of the features
> > > you mentioned, but the one thing that worries me is the number of
> > > unique binaries/executables involved (potentially 11). Communicating
> > > between 11 different processes via stdin/stdout and exit codes, even
> > > if the processes themselves are relatively simple, is fairly complex
> > > as a whole. IMO I don't really see a problem with implementing it as a
> > > single well-designed binary -- if it's Go, each proposed binary could
> > > just be its own package instead, with each package only exporting one
> > > high-level function. The main func would then be the "Aggregator" that
> > > simply calls each package's public function in turn, passing the
> > > output of one into the input of the next, checking for errors at each
> > > step. I think that would make it much easier to debug and test as a
> > > whole.
> > >
> > > I would also like to bring up the idea that we really need to change
> > > ORT's "pull" paradigm, or at least make the "pull" more efficient so
> > > that we don't have thousands of ORT instances all making the same
> > > requests to TO, with TO having to hit the DB for every request even
> > > though nothing has actually changed. Since we control ORT we have
> > > nearly 100% of control over all TO API requests made, yet we have a
> > > design that self-DDOSes itself by default right now. Do we want to
> > > tackle that problem as part of this redesign, or is that out of scope?
> > >
> > > - Rawlin
> > >
> > > On Thu, Apr 9, 2020 at 4:57 PM Robert O Butts <ro...@apache.org> wrote:
> > > >
> > > > I've made a Blueprint proposing to rewrite ORT:
> > > > https://github.com/apache/trafficcontrol/pull/4628
> > > >
> > > > If you have opinions on ORT, please read and provide feedback.
> > > >
> > > > In a nutshell, it's proposing to rewrite ORT in Go, in the "UNIX
> > > > Philosophy" of small, "do one thing" apps.
> > > >
> > > > Importantly, the proposal **removes** the following ORT features:
> > > >
> > > > chkconfig - CentOS 7+ and SystemD don't use chkconfig, and moreover
> our
> > > > default Profile runlevel is wrong and broken. But my knowledge of
> > > > CentOS,SystemD,chkconfig,runlevels isn't perfect, if I'm mistaken
> about
> > > > this and you're using ORT to set chkconfig, please let us know ASAP.
> > > >
> > > > ntpd - ORT today has code to set ntpd config and restart the ntpd
> > > service.
> > > > I have no idea why it was ever in charge of this, but this clearly
> seems
> > > to
> > > > be the system's job, not ORT or TC's.
> > > >
> > > > interactive mode - I asked around, and couldn't find anyone using
> this.
> > > > Does anyone use it? And feel it's essential to keep in ORT? And also
> feel
> > > > that the way this proposal breaks up the app so that it's easy to
> request
> > > > and compare files before applying them isn't sufficient?
> > > >
> > > > reval mode - This was put in because ORT was slow. ORT in master now
> > > takes
> > > > 10-20s on our large CDN. Moreover, "reval" mode is no longer
> > > significantly
> > > > faster than just applying everything. Does anyone feel otherwise?
> > > >
> > > > report mode - The functionality here is valuable. But intention here
> is
> > > to
> > > > replace "ORT report mode" with a pipelined set of app calls or a
> script
> > > to
> > > > do the same thing. I.e. because it's "UNIX-Style" you can just
> > > "ort-to-get
> > > > | ort-make-configs | ort-diff".
> > > >
> > > > package installation - This is the biggest feature the proposal
> removes,
> > > > and probably the most controversial. The thought is: this isn't
> something
> > > > ORT or Traffic Control should be doing. The same thing that manages
> the
> > > > physical machine and/or operating system -- whether that's Ansible,
> > > Puppet,
> > > > Chef, or a human System Administrator -- should be installing the OS
> > > > packages for ATS and its plugins, just like it manages all the other
> > > > packages on your system. ORT and TC should deploy configuration, not
> > > > install things.
> > > >
> > > > So yeah, feedback welcome. Feel free to post it on the list here or
> the
> > > > blueprint PR on github.
> > > >
> > > > Thanks,
> > >
>

Re: ORT Rewrite Proposal

Posted by Chris Lemmons <al...@gmail.com>.

> Communicating between 11 different processes via stdin/stdout and exit codes, even if the processes themselves are relatively simple, is fairly complex as a whole.

Yes and no. Using multiple processes doesn't actually reduce the
complexity of the task as a whole. What it does do is make it crystal
clear to a maintainer or operator what information a given process has
and what it produces.

Sure, there are other ways to do pure functional components, but
everybody knows those components never actually wind up purely
functional. It's very often impractical. Separate process spaces help
slice the problem into manageable pieces so that it's easy to
determine how any given component operates.

Additionally, we've had to rewrite and replace this beast once and
we're basically having to do it all at once. If we break it into
smaller components, we allow components to be tested in isolation, we
free ourselves to replace only specific subsets in the future, and by
using standard streams for communication, we create a baseline we can
leverage immediately as a testing hook.

The major disadvantage of putting these tasks in separate process
space is if there are large chunks of data multiple components need to
share. And that's one of the primary things I reviewed the blueprint
for. Large data structures aren't being repeatedly serialized and
deserialized. Most of the interfaces are fairly small.

> I would also like to bring up the idea that we really need to change ORT's "pull" paradigm, or at least make the "pull" more efficient so that we don't have thousands of ORT instances all making the same requests to TO

Changing the push/pull of the model doesn't help with efficiency. For
that, you need an indication of what config diffs need to be applied.
And that particular question isn't any easier to answer in either
model, it's the same inputs and outputs.

I'm definitely +1 on making this a deployment decision, though. With
the model proposed in the Blueprint, it's merely a question of when
and how the aggregator is invoked. The ability to invoke from cron,
pdsh, ansible, puppet, a shell, or via some future tool we add to the
TO UI would be very helpful. Besides, a frequent pull model (aided by
the aforementioned differential data transfer we have to develop in
any case) is usually as fast as push and much more reliable. I fully
expect reasonably fast polls against TC at some point when it can
return 304s on 99.9% of requests.

Push models also have different security implications that have to be
handled. It's doable, especially if TO supports two-way authenticated
TLS (it currently doesn't). But pull uses a security model people
already have their heads around. If I were deploying a push system,
I'd have to think very carefully about how all the various systems
were authenticated.

And by codifying the roles with clear inputs and outputs, we give CDN
operators the flexibility to modify those streams if necessary.

Some arbitrary ideas of varying value spring to mind: An operator may
have an in-house secrets management system that could serve as input
for the systems that require secrets. Or have a sanitizer that turns
production data into test data via specialized cleaning routines;
which could be invoked between the TO data and the config generator.
Or a testing system that mocks inputs and compares outputs to a
standard library of answers to validate regressions quickly and
efficiently. Or any of a variety of things that aren't mentioned here.
Clean inputs and outputs make those specific variations much easier to
manage, test, and use. And as our input and output formats change,
it'll be much more feasible for operators to manage those systems.

Lastly, Gmail is telling me that other people have replied since I
started this message. Hopefully, my points are still relevant. :)

On Mon, Apr 13, 2020 at 7:06 PM ocket 8888 <oc...@gmail.com> wrote:
>
> For what it's worth, I'd be +1 on re-examining "push" vs "pull" for ORT.
>
> On Mon, Apr 13, 2020, 16:46 Rawlin Peters <ra...@apache.org> wrote:
>
> > I'm generally +1 on redesigning ORT with the removal of the features
> > you mentioned, but the one thing that worries me is the number of
> > unique binaries/executables involved (potentially 11). Communicating
> > between 11 different processes via stdin/stdout and exit codes, even
> > if the processes themselves are relatively simple, is fairly complex
> > as a whole. IMO I don't really see a problem with implementing it as a
> > single well-designed binary -- if it's Go, each proposed binary could
> > just be its own package instead, with each package only exporting one
> > high-level function. The main func would then be the "Aggregator" that
> > simply calls each package's public function in turn, passing the
> > output of one into the input of the next, checking for errors at each
> > step. I think that would make it much easier to debug and test as a
> > whole.
> >
> > I would also like to bring up the idea that we really need to change
> > ORT's "pull" paradigm, or at least make the "pull" more efficient so
> > that we don't have thousands of ORT instances all making the same
> > requests to TO, with TO having to hit the DB for every request even
> > though nothing has actually changed. Since we control ORT we have
> > nearly 100% of control over all TO API requests made, yet we have a
> > design that self-DDOSes itself by default right now. Do we want to
> > tackle that problem as part of this redesign, or is that out of scope?
> >
> > - Rawlin
> >
> > On Thu, Apr 9, 2020 at 4:57 PM Robert O Butts <ro...@apache.org> wrote:
> > >
> > > I've made a Blueprint proposing to rewrite ORT:
> > > https://github.com/apache/trafficcontrol/pull/4628
> > >
> > > If you have opinions on ORT, please read and provide feedback.
> > >
> > > In a nutshell, it's proposing to rewrite ORT in Go, in the "UNIX
> > > Philosophy" of small, "do one thing" apps.
> > >
> > > Importantly, the proposal **removes** the following ORT features:
> > >
> > > chkconfig - CentOS 7+ and SystemD don't use chkconfig, and moreover our
> > > default Profile runlevel is wrong and broken. But my knowledge of
> > > CentOS,SystemD,chkconfig,runlevels isn't perfect, if I'm mistaken about
> > > this and you're using ORT to set chkconfig, please let us know ASAP.
> > >
> > > ntpd - ORT today has code to set ntpd config and restart the ntpd
> > service.
> > > I have no idea why it was ever in charge of this, but this clearly seems
> > to
> > > be the system's job, not ORT or TC's.
> > >
> > > interactive mode - I asked around, and couldn't find anyone using this.
> > > Does anyone use it? And feel it's essential to keep in ORT? And also feel
> > > that the way this proposal breaks up the app so that it's easy to request
> > > and compare files before applying them isn't sufficient?
> > >
> > > reval mode - This was put in because ORT was slow. ORT in master now
> > takes
> > > 10-20s on our large CDN. Moreover, "reval" mode is no longer
> > significantly
> > > faster than just applying everything. Does anyone feel otherwise?
> > >
> > > report mode - The functionality here is valuable. But intention here is
> > to
> > > replace "ORT report mode" with a pipelined set of app calls or a script
> > to
> > > do the same thing. I.e. because it's "UNIX-Style" you can just
> > "ort-to-get
> > > | ort-make-configs | ort-diff".
> > >
> > > package installation - This is the biggest feature the proposal removes,
> > > and probably the most controversial. The thought is: this isn't something
> > > ORT or Traffic Control should be doing. The same thing that manages the
> > > physical machine and/or operating system -- whether that's Ansible,
> > Puppet,
> > > Chef, or a human System Administrator -- should be installing the OS
> > > packages for ATS and its plugins, just like it manages all the other
> > > packages on your system. ORT and TC should deploy configuration, not
> > > install things.
> > >
> > > So yeah, feedback welcome. Feel free to post it on the list here or the
> > > blueprint PR on github.
> > >
> > > Thanks,
> >

Re: ORT Rewrite Proposal

Posted by ocket 8888 <oc...@gmail.com>.

For what it's worth, I'd be +1 on re-examining "push" vs "pull" for ORT.

On Mon, Apr 13, 2020, 16:46 Rawlin Peters <ra...@apache.org> wrote:

> I'm generally +1 on redesigning ORT with the removal of the features
> you mentioned, but the one thing that worries me is the number of
> unique binaries/executables involved (potentially 11). Communicating
> between 11 different processes via stdin/stdout and exit codes, even
> if the processes themselves are relatively simple, is fairly complex
> as a whole. IMO I don't really see a problem with implementing it as a
> single well-designed binary -- if it's Go, each proposed binary could
> just be its own package instead, with each package only exporting one
> high-level function. The main func would then be the "Aggregator" that
> simply calls each package's public function in turn, passing the
> output of one into the input of the next, checking for errors at each
> step. I think that would make it much easier to debug and test as a
> whole.
>
> I would also like to bring up the idea that we really need to change
> ORT's "pull" paradigm, or at least make the "pull" more efficient so
> that we don't have thousands of ORT instances all making the same
> requests to TO, with TO having to hit the DB for every request even
> though nothing has actually changed. Since we control ORT we have
> nearly 100% of control over all TO API requests made, yet we have a
> design that self-DDOSes itself by default right now. Do we want to
> tackle that problem as part of this redesign, or is that out of scope?
>
> - Rawlin
>
> On Thu, Apr 9, 2020 at 4:57 PM Robert O Butts <ro...@apache.org> wrote:
> >
> > I've made a Blueprint proposing to rewrite ORT:
> > https://github.com/apache/trafficcontrol/pull/4628
> >
> > If you have opinions on ORT, please read and provide feedback.
> >
> > In a nutshell, it's proposing to rewrite ORT in Go, in the "UNIX
> > Philosophy" of small, "do one thing" apps.
> >
> > Importantly, the proposal **removes** the following ORT features:
> >
> > chkconfig - CentOS 7+ and SystemD don't use chkconfig, and moreover our
> > default Profile runlevel is wrong and broken. But my knowledge of
> > CentOS,SystemD,chkconfig,runlevels isn't perfect, if I'm mistaken about
> > this and you're using ORT to set chkconfig, please let us know ASAP.
> >
> > ntpd - ORT today has code to set ntpd config and restart the ntpd
> service.
> > I have no idea why it was ever in charge of this, but this clearly seems
> to
> > be the system's job, not ORT or TC's.
> >
> > interactive mode - I asked around, and couldn't find anyone using this.
> > Does anyone use it? And feel it's essential to keep in ORT? And also feel
> > that the way this proposal breaks up the app so that it's easy to request
> > and compare files before applying them isn't sufficient?
> >
> > reval mode - This was put in because ORT was slow. ORT in master now
> takes
> > 10-20s on our large CDN. Moreover, "reval" mode is no longer
> significantly
> > faster than just applying everything. Does anyone feel otherwise?
> >
> > report mode - The functionality here is valuable. But intention here is
> to
> > replace "ORT report mode" with a pipelined set of app calls or a script
> to
> > do the same thing. I.e. because it's "UNIX-Style" you can just
> "ort-to-get
> > | ort-make-configs | ort-diff".
> >
> > package installation - This is the biggest feature the proposal removes,
> > and probably the most controversial. The thought is: this isn't something
> > ORT or Traffic Control should be doing. The same thing that manages the
> > physical machine and/or operating system -- whether that's Ansible,
> Puppet,
> > Chef, or a human System Administrator -- should be installing the OS
> > packages for ATS and its plugins, just like it manages all the other
> > packages on your system. ORT and TC should deploy configuration, not
> > install things.
> >
> > So yeah, feedback welcome. Feel free to post it on the list here or the
> > blueprint PR on github.
> >
> > Thanks,
>

Re: [EXTERNAL] Re: ORT Rewrite Proposal

Posted by "Gray, Jonathan" <Jo...@comcast.com>.

If all the binaries are compiled and shipped together all the time, I could go either way.  The main gain I can see though would be in debugging and test scope with small binaries.  It's easier to say affirmatively that change A's scope of effect is limited to one testable object of 4, then it is to wonder if the monolith has some other dependent codepaths that have to be checked as well.  Having one binary to rule them all I think fosters a human habit of scope creep to just make it do one more thing instead of focusing on a specific set of jobs.  I'm a huge fan of adding more responsibility on the system operators to use their native toolsets to facilitate several of the jobs ORT has traditionally done.  That helps the project lower its overall maintenance obligation and provides greater flexibility so it's easier to break into new environment configurations.

I'm also not a fan (-1) on push instead of pull.  It trades the DDoS problem you mention for having to manage all the orchestration surrounding when things apply and what happens in a whole new set of error cases where a push message gets missed in the network somewhere.  Even if you think of a message bus of some kind makes it better, that just adds another layer of complexity and fault domain to the overall solution.  A fast-enough poll is also indistinguishable from push.  Instead, I think it's more worth looking at how to "take the mass out of the hammer".  We're making significant strides to reduce our most expensive queries now, and that's only going to get better with flexible cachegroups.  Http caching could get us a very long way for things like making ORT take a smaller resource hit or making TP more responsive.  If the database queries are still too much, we could look at splitting read queries off onto a separate connection string for multiple RO replicas.

Jonathan G

On 4/13/20, 4:46 PM, "Rawlin Peters" <ra...@apache.org> wrote:

    I'm generally +1 on redesigning ORT with the removal of the features
    you mentioned, but the one thing that worries me is the number of
    unique binaries/executables involved (potentially 11). Communicating
    between 11 different processes via stdin/stdout and exit codes, even
    if the processes themselves are relatively simple, is fairly complex
    as a whole. IMO I don't really see a problem with implementing it as a
    single well-designed binary -- if it's Go, each proposed binary could
    just be its own package instead, with each package only exporting one
    high-level function. The main func would then be the "Aggregator" that
    simply calls each package's public function in turn, passing the
    output of one into the input of the next, checking for errors at each
    step. I think that would make it much easier to debug and test as a
    whole.

    I would also like to bring up the idea that we really need to change
    ORT's "pull" paradigm, or at least make the "pull" more efficient so
    that we don't have thousands of ORT instances all making the same
    requests to TO, with TO having to hit the DB for every request even
    though nothing has actually changed. Since we control ORT we have
    nearly 100% of control over all TO API requests made, yet we have a
    design that self-DDOSes itself by default right now. Do we want to
    tackle that problem as part of this redesign, or is that out of scope?

    - Rawlin

    On Thu, Apr 9, 2020 at 4:57 PM Robert O Butts <ro...@apache.org> wrote:
    >
    > I've made a Blueprint proposing to rewrite ORT:
    > https://urldefense.com/v3/__https://github.com/apache/trafficcontrol/pull/4628__;!!CQl3mcHX2A!WP8MIrdRGn9EvXJUOSFoKai78dFn2hTY6cWc-BQ29yg69KNi_bYeuPFZaKxRSgsU2s3r$
    >
    > If you have opinions on ORT, please read and provide feedback.
    >
    > In a nutshell, it's proposing to rewrite ORT in Go, in the "UNIX
    > Philosophy" of small, "do one thing" apps.
    >
    > Importantly, the proposal **removes** the following ORT features:
    >
    > chkconfig - CentOS 7+ and SystemD don't use chkconfig, and moreover our
    > default Profile runlevel is wrong and broken. But my knowledge of
    > CentOS,SystemD,chkconfig,runlevels isn't perfect, if I'm mistaken about
    > this and you're using ORT to set chkconfig, please let us know ASAP.
    >
    > ntpd - ORT today has code to set ntpd config and restart the ntpd service.
    > I have no idea why it was ever in charge of this, but this clearly seems to
    > be the system's job, not ORT or TC's.
    >
    > interactive mode - I asked around, and couldn't find anyone using this.
    > Does anyone use it? And feel it's essential to keep in ORT? And also feel
    > that the way this proposal breaks up the app so that it's easy to request
    > and compare files before applying them isn't sufficient?
    >
    > reval mode - This was put in because ORT was slow. ORT in master now takes
    > 10-20s on our large CDN. Moreover, "reval" mode is no longer significantly
    > faster than just applying everything. Does anyone feel otherwise?
    >
    > report mode - The functionality here is valuable. But intention here is to
    > replace "ORT report mode" with a pipelined set of app calls or a script to
    > do the same thing. I.e. because it's "UNIX-Style" you can just "ort-to-get
    > | ort-make-configs | ort-diff".
    >
    > package installation - This is the biggest feature the proposal removes,
    > and probably the most controversial. The thought is: this isn't something
    > ORT or Traffic Control should be doing. The same thing that manages the
    > physical machine and/or operating system -- whether that's Ansible, Puppet,
    > Chef, or a human System Administrator -- should be installing the OS
    > packages for ATS and its plugins, just like it manages all the other
    > packages on your system. ORT and TC should deploy configuration, not
    > install things.
    >
    > So yeah, feedback welcome. Feel free to post it on the list here or the
    > blueprint PR on github.
    >
    > Thanks,

Re: ORT Rewrite Proposal

Posted by Rawlin Peters <ra...@apache.org>.

I'm generally +1 on redesigning ORT with the removal of the features
you mentioned, but the one thing that worries me is the number of
unique binaries/executables involved (potentially 11). Communicating
between 11 different processes via stdin/stdout and exit codes, even
if the processes themselves are relatively simple, is fairly complex
as a whole. IMO I don't really see a problem with implementing it as a
single well-designed binary -- if it's Go, each proposed binary could
just be its own package instead, with each package only exporting one
high-level function. The main func would then be the "Aggregator" that
simply calls each package's public function in turn, passing the
output of one into the input of the next, checking for errors at each
step. I think that would make it much easier to debug and test as a
whole.

I would also like to bring up the idea that we really need to change
ORT's "pull" paradigm, or at least make the "pull" more efficient so
that we don't have thousands of ORT instances all making the same
requests to TO, with TO having to hit the DB for every request even
though nothing has actually changed. Since we control ORT we have
nearly 100% of control over all TO API requests made, yet we have a
design that self-DDOSes itself by default right now. Do we want to
tackle that problem as part of this redesign, or is that out of scope?

- Rawlin

On Thu, Apr 9, 2020 at 4:57 PM Robert O Butts <ro...@apache.org> wrote:
>
> I've made a Blueprint proposing to rewrite ORT:
> https://github.com/apache/trafficcontrol/pull/4628
>
> If you have opinions on ORT, please read and provide feedback.
>
> In a nutshell, it's proposing to rewrite ORT in Go, in the "UNIX
> Philosophy" of small, "do one thing" apps.
>
> Importantly, the proposal **removes** the following ORT features:
>
> chkconfig - CentOS 7+ and SystemD don't use chkconfig, and moreover our
> default Profile runlevel is wrong and broken. But my knowledge of
> CentOS,SystemD,chkconfig,runlevels isn't perfect, if I'm mistaken about
> this and you're using ORT to set chkconfig, please let us know ASAP.
>
> ntpd - ORT today has code to set ntpd config and restart the ntpd service.
> I have no idea why it was ever in charge of this, but this clearly seems to
> be the system's job, not ORT or TC's.
>
> interactive mode - I asked around, and couldn't find anyone using this.
> Does anyone use it? And feel it's essential to keep in ORT? And also feel
> that the way this proposal breaks up the app so that it's easy to request
> and compare files before applying them isn't sufficient?
>
> reval mode - This was put in because ORT was slow. ORT in master now takes
> 10-20s on our large CDN. Moreover, "reval" mode is no longer significantly
> faster than just applying everything. Does anyone feel otherwise?
>
> report mode - The functionality here is valuable. But intention here is to
> replace "ORT report mode" with a pipelined set of app calls or a script to
> do the same thing. I.e. because it's "UNIX-Style" you can just "ort-to-get
> | ort-make-configs | ort-diff".
>
> package installation - This is the biggest feature the proposal removes,
> and probably the most controversial. The thought is: this isn't something
> ORT or Traffic Control should be doing. The same thing that manages the
> physical machine and/or operating system -- whether that's Ansible, Puppet,
> Chef, or a human System Administrator -- should be installing the OS
> packages for ATS and its plugins, just like it manages all the other
> packages on your system. ORT and TC should deploy configuration, not
> install things.
>
> So yeah, feedback welcome. Feel free to post it on the list here or the
> blueprint PR on github.
>
> Thanks,