You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficcontrol.apache.org by GitBox <gi...@apache.org> on 2022/01/28 18:00:00 UTC

[GitHub] [trafficcontrol] rawlinp opened a new pull request #6527: Traffic Monitor Distributed Polling

rawlinp opened a new pull request #6527:
URL: https://github.com/apache/trafficcontrol/pull/6527


   Rather than having a single TM poll an entire CDN, the new
   `distributed_polling` config option allows a TM to poll a subset of the
   CDN and share its data to other TMs outside of its TM group
   (cachegroup). This separates TM peers into two different categories:
   
   - local peers
   - distributed peers
   
   A CDN is divided into N subsets, where N is the number of TM groups
   (cachegroups). Each TM in a TM group polls their assigned cachegroups,
   combining state with the other TMs in its group (its local peers), and
   polls TMs from every other TM group (its distributed peers) in order to
   get health availability for the entire CDN. Therefore, each TM is only
   responsible for polling a subset of the CDN yet still has a full view of
   the CDN's health. This means TR and the tc-health-client can still poll
   TM's /publish/CrStates endpoint to get availability for the entire CDN.
   
   <!--
   Thank you for contributing! Please be sure to read our contribution guidelines: https://github.com/apache/trafficcontrol/blob/master/CONTRIBUTING.md
   If this closes or relates to an existing issue, please reference it using one of the following:
   
   Closes: #ISSUE
   Related: #ISSUE
   
   If this PR fixes a security vulnerability, DO NOT submit! Instead, contact
   the Apache Traffic Control Security Team at security@trafficcontrol.apache.org and follow the
   guidelines at https://apache.org/security regarding vulnerability disclosure.
   -->
   
   
   <!-- **^ Add meaningful description above** --><hr/>
   
   ## Which Traffic Control components are affected by this PR?
   <!-- Please delete all components from this list that are NOT affected by this PR.
   Feel free to add the name of a tool or script that is affected but not on the list.
   -->
   - Documentation
   - Traffic Monitor
   
   ## What is the best way to verify this PR?
   <!-- Please include here ALL the steps necessary to test your PR.
   If your PR has tests (and most should), provide the steps needed to run the tests.
   If not, please provide step-by-step instructions to test the PR manually and explain why your PR does not need tests. -->
   Firstly, run Traffic Monitor without enabling `distributed_polling`. TM should continue to function normally in a non-distributed manner. Secondly, enable `distributed_polling` and ensure TM functions properly. In order to do this, you may need at least 4 TMs divided into two separate cachegroups (2 TMs per group) in Traffic Ops.
   
   Note: for local testing, I configured 9 TM servers in three different groups using IP addresses 127.0.0.2 - 127.0.0.10. On MacOS, you can use this command to configure these loopback addresses:
   ```
   for i in {2..10}; do sudo ifconfig lo0 alias 127.0.0.$i up; done
   ```
   Since TM uses hostnames instead of IP addresses for peer polling, I tried adding the TM FQDNs to `/etc/hosts`, but for some reason it seems Go on MacOS doesn't check that file (you may have better luck doing this on Linux). So instead I set up some A records in a lab DNS zone to return 127.0.0.2 - 127.0.0.10 so that TM makes real DNS requests which return these loopback addresses.
   
   ## PR submission checklist
   - [x] This PR has tests <!-- If not, please delete this text and explain why this PR does not need tests. -->
   - [x] This PR has documentation <!-- If not, please delete this text and explain why this PR does not need documentation. -->
   - [x] This PR has a CHANGELOG.md entry <!-- A fix for a bug from an ATC release, an improvement, or a new feature should have a changelog entry. -->
   - [x] This PR **DOES NOT FIX A SERIOUS SECURITY VULNERABILITY** (see [the Apache Software Foundation's security guidelines](https://apache.org/security) for details)
   
   <!--
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
   Unless required by applicable law or agreed to in writing,
   software distributed under the License is distributed on an
   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   KIND, either express or implied.  See the License for the
   specific language governing permissions and limitations
   under the License.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@trafficcontrol.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficcontrol] rawlinp commented on a change in pull request #6527: Traffic Monitor Distributed Polling

Posted by GitBox <gi...@apache.org>.
rawlinp commented on a change in pull request #6527:
URL: https://github.com/apache/trafficcontrol/pull/6527#discussion_r804078427



##########
File path: docs/source/development/traffic_monitor/traffic_monitor_api.rst
##########
@@ -402,6 +402,35 @@ Response Structure
 TODO
 
 
+``/publish/DistributedPeerStates``
+==================================
+The health state information from all distributed peer Traffic Monitors.
+
+``GET``
+-------
+:Response Type: ?
+
+Request Structure
+"""""""""""""""""
+.. table:: Request Query Parameters
+
+	+--------------+---------+------------------------------------------------+
+	|  Parameter   | Type    |                  Description                   |
+	+==============+=========+================================================+
+	| ``hc``       | integer | The history count, number of items to display. |
+	+--------------+---------+------------------------------------------------+
+	| ``stats``    | string  | A comma separated list of stats to display.    |
+	+--------------+---------+------------------------------------------------+
+	| ``wildcard`` | boolean | Controls whether specified stats should be     |
+	|              |         | treated as partial strings.                    |
+	+--------------+---------+------------------------------------------------+
+
+Response Structure
+""""""""""""""""""
+
+TODO

Review comment:
       Yes, I think you're referring to https://github.com/apache/trafficcontrol/issues/3351




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@trafficcontrol.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficcontrol] rawlinp commented on pull request #6527: Traffic Monitor Distributed Polling

Posted by GitBox <gi...@apache.org>.
rawlinp commented on pull request #6527:
URL: https://github.com/apache/trafficcontrol/pull/6527#issuecomment-1024807255


   Marking this `WIP` until I can figure out why the TM integration tests are failing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@trafficcontrol.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficcontrol] TaylorCFrey commented on a change in pull request #6527: Traffic Monitor Distributed Polling

Posted by GitBox <gi...@apache.org>.
TaylorCFrey commented on a change in pull request #6527:
URL: https://github.com/apache/trafficcontrol/pull/6527#discussion_r803359752



##########
File path: docs/source/development/traffic_monitor/traffic_monitor_api.rst
##########
@@ -402,6 +402,35 @@ Response Structure
 TODO
 
 
+``/publish/DistributedPeerStates``
+==================================
+The health state information from all distributed peer Traffic Monitors.
+
+``GET``
+-------
+:Response Type: ?
+
+Request Structure
+"""""""""""""""""
+.. table:: Request Query Parameters
+
+	+--------------+---------+------------------------------------------------+
+	|  Parameter   | Type    |                  Description                   |
+	+==============+=========+================================================+
+	| ``hc``       | integer | The history count, number of items to display. |
+	+--------------+---------+------------------------------------------------+
+	| ``stats``    | string  | A comma separated list of stats to display.    |
+	+--------------+---------+------------------------------------------------+
+	| ``wildcard`` | boolean | Controls whether specified stats should be     |
+	|              |         | treated as partial strings.                    |
+	+--------------+---------+------------------------------------------------+
+
+Response Structure
+""""""""""""""""""
+
+TODO

Review comment:
       I believe there are issues filed to add more information to the TODOs for the TM APIs, is that correct?

##########
File path: lib/go-tc/traffic_monitor.go
##########
@@ -438,7 +438,7 @@ type TrafficMonitor struct {
 	Profile string `json:"profile"`
 	// Location is the Name of the Cache Group to which the Traffic Monitor
 	// belongs - called "Location" for legacy reasons.
-	Location string `json:"location"`
+	Location string `json:"cachegroup"`

Review comment:
       Since we are cleaning up the JSON object, can we remove the explanation comment above as well?

##########
File path: traffic_monitor/config/config.go
##########
@@ -237,6 +235,9 @@ func (c *Config) UnmarshalJSON(data []byte) error {
 	if aux.TrafficOpsMaxRetryIntervalMs != nil {
 		c.TrafficOpsMaxRetryInterval = time.Duration(*aux.TrafficOpsMaxRetryIntervalMs) * time.Millisecond
 	}
+	if c.StatPolling && c.DistributedPolling {
+		return errors.New("invalid configuration: stat_polling cannot be enabled if distributed_polling is also enabled")

Review comment:
       If they are both `false` is that state something we need to account for?

##########
File path: traffic_monitor/datareq/crstate.go
##########
@@ -51,15 +58,36 @@ func srvTRState(params url.Values, localStates peer.CRStatesThreadsafe, combined
 		}
 	}
 
-	data, err := srvTRStateDerived(combinedStates, peerStates)
+	data, err := srvTRStateDerived(combinedStates, local && distributedPollingEnabled)
 
 	return data, http.StatusOK, err
 }
 
-func srvTRStateDerived(combinedStates peer.CRStatesThreadsafe, peerStates peer.CRStatesPeersThreadsafe) ([]byte, error) {
-	return tc.CRStatesMarshall(combinedStates.Get())
+func srvTRStateDerived(combinedStates peer.CRStatesThreadsafe, directlyPolledOnly bool) ([]byte, error) {
+	if !directlyPolledOnly {

Review comment:
       For this `directlyPolledOnly` check, is the desire here to not _have_ to filter if we already know it's not distributed? It seems like we could filter each time, but perhaps we are concerned about performance?
   
   Could we filter the `combinedStates` once when we initially receive the results to avoid passing a boolean flag into the function?

##########
File path: traffic_monitor/peer/crstates.go
##########
@@ -36,7 +36,7 @@ type CRStatesThreadsafe struct {
 
 // NewCRStatesThreadsafe creates a new CRStatesThreadsafe object safe for multiple goroutine readers and a single writer.
 func NewCRStatesThreadsafe() CRStatesThreadsafe {
-	crs := tc.NewCRStates()
+	crs := tc.NewCRStates(8, 8)

Review comment:
       Could we label these values so as to avoid "magic numbers"?

##########
File path: traffic_monitor/cache/data.go
##########
@@ -25,11 +25,6 @@ import (
 	"github.com/apache/trafficcontrol/lib/go-tc"
 )
 
-// AvailableStatusReported is the status string returned by caches set to

Review comment:
       Is this simply removed and not actually "put somewhere more generic"?

##########
File path: traffic_monitor/manager/monitorconfig.go
##########
@@ -337,6 +379,103 @@ func monitorConfigListen(
 	}
 }
 
+// getCacheGroupsToPoll returns the name of this Traffic Monitor's cache group
+// and the set of cache groups it needs to poll.
+func getCacheGroupsToPoll(distributedPolling bool, hostname string, monitors map[string]tc.TrafficMonitor,
+	caches map[string]tc.TrafficServer, allCacheGroups map[string]tc.TMCacheGroup) (string, map[string]tc.TMCacheGroup, error) {
+	tmGroupSet := make(map[string]tc.TMCacheGroup)
+	cacheGroupSet := make(map[string]tc.TMCacheGroup)
+	tmGroupToPolledCacheGroups := make(map[string]map[string]tc.TMCacheGroup)
+	thisTMGroup := ""
+
+	for _, tm := range monitors {
+		if tm.HostName == hostname {
+			thisTMGroup = tm.Location
+		}
+		if tc.CacheStatusFromString(tm.ServerStatus) == tc.CacheStatusOnline {
+			tmGroupSet[tm.Location] = allCacheGroups[tm.Location]
+			if _, ok := tmGroupToPolledCacheGroups[tm.Location]; !ok {
+				tmGroupToPolledCacheGroups[tm.Location] = make(map[string]tc.TMCacheGroup)
+			}
+		}
+	}
+	if thisTMGroup == "" {
+		return "", nil, fmt.Errorf("unable to find cache group for this Traffic Monitor (%s) in monitoring config snapshot", hostname)
+	}
+
+	for _, c := range caches {
+		status := tc.CacheStatusFromString(c.ServerStatus)
+		if status == tc.CacheStatusOnline || status == tc.CacheStatusReported || status == tc.CacheStatusAdminDown {
+			cacheGroupSet[c.CacheGroup] = allCacheGroups[c.CacheGroup]
+		}
+	}
+
+	if !distributedPolling {
+		return thisTMGroup, cacheGroupSet, nil
+	}
+
+	tmGroups := make([]string, 0, len(tmGroupSet))
+	for tg := range tmGroupSet {
+		tmGroups = append(tmGroups, tg)
+	}
+	sort.Strings(tmGroups)
+	cgs := make([]string, 0, len(cacheGroupSet))
+	for cg := range cacheGroupSet {
+		cgs = append(cgs, cg)
+	}
+	sort.Strings(cgs)

Review comment:
       Is it necessary to sort the tmGroups and cacheGroupSets?

##########
File path: traffic_monitor/manager/monitorconfig.go
##########
@@ -337,6 +379,103 @@ func monitorConfigListen(
 	}
 }
 
+// getCacheGroupsToPoll returns the name of this Traffic Monitor's cache group
+// and the set of cache groups it needs to poll.
+func getCacheGroupsToPoll(distributedPolling bool, hostname string, monitors map[string]tc.TrafficMonitor,
+	caches map[string]tc.TrafficServer, allCacheGroups map[string]tc.TMCacheGroup) (string, map[string]tc.TMCacheGroup, error) {
+	tmGroupSet := make(map[string]tc.TMCacheGroup)
+	cacheGroupSet := make(map[string]tc.TMCacheGroup)
+	tmGroupToPolledCacheGroups := make(map[string]map[string]tc.TMCacheGroup)
+	thisTMGroup := ""
+
+	for _, tm := range monitors {
+		if tm.HostName == hostname {
+			thisTMGroup = tm.Location
+		}
+		if tc.CacheStatusFromString(tm.ServerStatus) == tc.CacheStatusOnline {
+			tmGroupSet[tm.Location] = allCacheGroups[tm.Location]
+			if _, ok := tmGroupToPolledCacheGroups[tm.Location]; !ok {
+				tmGroupToPolledCacheGroups[tm.Location] = make(map[string]tc.TMCacheGroup)
+			}
+		}
+	}
+	if thisTMGroup == "" {
+		return "", nil, fmt.Errorf("unable to find cache group for this Traffic Monitor (%s) in monitoring config snapshot", hostname)
+	}
+
+	for _, c := range caches {
+		status := tc.CacheStatusFromString(c.ServerStatus)
+		if status == tc.CacheStatusOnline || status == tc.CacheStatusReported || status == tc.CacheStatusAdminDown {
+			cacheGroupSet[c.CacheGroup] = allCacheGroups[c.CacheGroup]
+		}
+	}
+
+	if !distributedPolling {
+		return thisTMGroup, cacheGroupSet, nil
+	}
+
+	tmGroups := make([]string, 0, len(tmGroupSet))
+	for tg := range tmGroupSet {
+		tmGroups = append(tmGroups, tg)
+	}
+	sort.Strings(tmGroups)
+	cgs := make([]string, 0, len(cacheGroupSet))
+	for cg := range cacheGroupSet {
+		cgs = append(cgs, cg)
+	}
+	sort.Strings(cgs)
+	tmGroupCount := len(tmGroups)
+	var closest string
+	for tmi := 0; len(cgs) > 0; tmi = (tmi + 1) % tmGroupCount {
+		tmGroup := tmGroups[tmi]
+		closest, cgs = findAndRemoveClosestCachegroup(cgs, allCacheGroups[tmGroup], allCacheGroups)
+		tmGroupToPolledCacheGroups[tmGroup][closest] = allCacheGroups[closest]
+	}
+	return thisTMGroup, tmGroupToPolledCacheGroups[thisTMGroup], nil
+}
+
+func findAndRemoveClosestCachegroup(remainingCacheGroups []string, target tc.TMCacheGroup, allCacheGroups map[string]tc.TMCacheGroup) (string, []string) {
+	shortestDistance := math.MaxFloat64
+	shortestIndex := -1
+	for i := 0; i < len(remainingCacheGroups); i++ {
+		distance := getDistance(target, allCacheGroups[remainingCacheGroups[i]])
+		if distance < shortestDistance {
+			shortestDistance = distance
+			shortestIndex = i
+		}
+	}
+	closest := remainingCacheGroups[shortestIndex]
+	remainingCacheGroups = append(remainingCacheGroups[:shortestIndex], remainingCacheGroups[shortestIndex+1:]...)
+	return closest, remainingCacheGroups
+}
+
+const meanEarthRadius = 6371.0
+const x = math.Pi / 180
+
+// toRadians converts degrees to radians.
+func toRadians(d float64) float64 {
+	return d * x
+}
+
+// getDistance gets the great circle distance in kilometers between x and y.
+func getDistance(x, y tc.TMCacheGroup) float64 {

Review comment:
       Whoa, nice. I'll asume this is correct ;P Is this Haversine?

##########
File path: traffic_monitor/conf/traffic_monitor.cfg
##########
@@ -4,6 +4,8 @@
 	"peer_optimistic": true,
 	"peer_optimistic_quorum_min": 0,
 	"max_events": 200,
+	"stat_polling": true,
+	"distributed_polling": false,

Review comment:
       Is this flag then necessary for all TMs? Or only for those that wish to participate in distributed polling? Can you have some with it `false` and some with it `true` or will all TMs then have to abide by one or the other methods?

##########
File path: traffic_monitor/manager/monitorconfig.go
##########
@@ -286,25 +307,46 @@ func monitorConfigListen(
 		}
 
 		peerSet := map[tc.TrafficMonitorName]struct{}{}
+		tmsByGroup := make(map[string][]tc.TrafficMonitor)
 		for _, srv := range monitorConfig.TrafficMonitor {
-			if srv.HostName == staticAppData.Hostname {
+			if tc.CacheStatusFromString(srv.ServerStatus) != tc.CacheStatusOnline {
+				continue
+			}
+			tmsByGroup[srv.Location] = append(tmsByGroup[srv.Location], srv)
+		}
+
+		for _, srv := range monitorConfig.TrafficMonitor {
+			if srv.HostName == staticAppData.Hostname || (cfg.DistributedPolling && srv.Location != thisTMGroup) {
 				continue
 			}
 			if tc.CacheStatusFromString(srv.ServerStatus) != tc.CacheStatusOnline {
 				continue
 			}
 			// TODO: the URL should be config driven. -jse
-			url4 := fmt.Sprintf("http://%s:%d/publish/CrStates?raw", srv.IP, srv.Port)
-			url6 := fmt.Sprintf("http://[%s]:%d/publish/CrStates?raw", ipv6CIDRStrToAddr(srv.IP6), srv.Port)
-			peerURLs[srv.HostName] = poller.PollConfig{URL: url4, URLv6: url6, Host: srv.FQDN} // TODO determine timeout.
+			peerURL := fmt.Sprintf("http://%s:%d/publish/CrStates?raw", srv.FQDN, srv.Port)

Review comment:
       Do we no longer need IPv6?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@trafficcontrol.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficcontrol] rawlinp commented on a change in pull request #6527: Traffic Monitor Distributed Polling

Posted by GitBox <gi...@apache.org>.
rawlinp commented on a change in pull request #6527:
URL: https://github.com/apache/trafficcontrol/pull/6527#discussion_r804125720



##########
File path: traffic_monitor/peer/crstates.go
##########
@@ -36,7 +36,7 @@ type CRStatesThreadsafe struct {
 
 // NewCRStatesThreadsafe creates a new CRStatesThreadsafe object safe for multiple goroutine readers and a single writer.
 func NewCRStatesThreadsafe() CRStatesThreadsafe {
-	crs := tc.NewCRStates()
+	crs := tc.NewCRStates(8, 8)

Review comment:
       Yes, this is the default map capacity when no capacity is given. I'll label them as such.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@trafficcontrol.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficcontrol] rawlinp commented on a change in pull request #6527: Traffic Monitor Distributed Polling

Posted by GitBox <gi...@apache.org>.
rawlinp commented on a change in pull request #6527:
URL: https://github.com/apache/trafficcontrol/pull/6527#discussion_r804119961



##########
File path: traffic_monitor/datareq/crstate.go
##########
@@ -51,15 +58,36 @@ func srvTRState(params url.Values, localStates peer.CRStatesThreadsafe, combined
 		}
 	}
 
-	data, err := srvTRStateDerived(combinedStates, peerStates)
+	data, err := srvTRStateDerived(combinedStates, local && distributedPollingEnabled)
 
 	return data, http.StatusOK, err
 }
 
-func srvTRStateDerived(combinedStates peer.CRStatesThreadsafe, peerStates peer.CRStatesPeersThreadsafe) ([]byte, error) {
-	return tc.CRStatesMarshall(combinedStates.Get())
+func srvTRStateDerived(combinedStates peer.CRStatesThreadsafe, directlyPolledOnly bool) ([]byte, error) {
+	if !directlyPolledOnly {

Review comment:
       On 2nd thought, `combinedStates` definitely contains all the caches, not just the directly-polled. So we do need the filtering.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@trafficcontrol.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficcontrol] rawlinp commented on a change in pull request #6527: Traffic Monitor Distributed Polling

Posted by GitBox <gi...@apache.org>.
rawlinp commented on a change in pull request #6527:
URL: https://github.com/apache/trafficcontrol/pull/6527#discussion_r804112914



##########
File path: traffic_monitor/config/config.go
##########
@@ -237,6 +235,9 @@ func (c *Config) UnmarshalJSON(data []byte) error {
 	if aux.TrafficOpsMaxRetryIntervalMs != nil {
 		c.TrafficOpsMaxRetryInterval = time.Duration(*aux.TrafficOpsMaxRetryIntervalMs) * time.Millisecond
 	}
+	if c.StatPolling && c.DistributedPolling {
+		return errors.New("invalid configuration: stat_polling cannot be enabled if distributed_polling is also enabled")

Review comment:
       Not really, that is a perfectly valid state. TMs don't have to be distributed in order to disable health polling. If, for instance, someone's legacy TMs were overloaded, they could disable stat polling without going distributed in order to reduce load.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@trafficcontrol.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficcontrol] mattjackson220 merged pull request #6527: Traffic Monitor Distributed Polling

Posted by GitBox <gi...@apache.org>.
mattjackson220 merged pull request #6527:
URL: https://github.com/apache/trafficcontrol/pull/6527


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@trafficcontrol.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficcontrol] rawlinp commented on a change in pull request #6527: Traffic Monitor Distributed Polling

Posted by GitBox <gi...@apache.org>.
rawlinp commented on a change in pull request #6527:
URL: https://github.com/apache/trafficcontrol/pull/6527#discussion_r804081842



##########
File path: traffic_monitor/cache/data.go
##########
@@ -25,11 +25,6 @@ import (
 	"github.com/apache/trafficcontrol/lib/go-tc"
 )
 
-// AvailableStatusReported is the status string returned by caches set to

Review comment:
       It already existed at `tc.CacheStatusReported`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@trafficcontrol.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficcontrol] rawlinp commented on a change in pull request #6527: Traffic Monitor Distributed Polling

Posted by GitBox <gi...@apache.org>.
rawlinp commented on a change in pull request #6527:
URL: https://github.com/apache/trafficcontrol/pull/6527#discussion_r804123740



##########
File path: traffic_monitor/manager/monitorconfig.go
##########
@@ -337,6 +379,103 @@ func monitorConfigListen(
 	}
 }
 
+// getCacheGroupsToPoll returns the name of this Traffic Monitor's cache group
+// and the set of cache groups it needs to poll.
+func getCacheGroupsToPoll(distributedPolling bool, hostname string, monitors map[string]tc.TrafficMonitor,
+	caches map[string]tc.TrafficServer, allCacheGroups map[string]tc.TMCacheGroup) (string, map[string]tc.TMCacheGroup, error) {
+	tmGroupSet := make(map[string]tc.TMCacheGroup)
+	cacheGroupSet := make(map[string]tc.TMCacheGroup)
+	tmGroupToPolledCacheGroups := make(map[string]map[string]tc.TMCacheGroup)
+	thisTMGroup := ""
+
+	for _, tm := range monitors {
+		if tm.HostName == hostname {
+			thisTMGroup = tm.Location
+		}
+		if tc.CacheStatusFromString(tm.ServerStatus) == tc.CacheStatusOnline {
+			tmGroupSet[tm.Location] = allCacheGroups[tm.Location]
+			if _, ok := tmGroupToPolledCacheGroups[tm.Location]; !ok {
+				tmGroupToPolledCacheGroups[tm.Location] = make(map[string]tc.TMCacheGroup)
+			}
+		}
+	}
+	if thisTMGroup == "" {
+		return "", nil, fmt.Errorf("unable to find cache group for this Traffic Monitor (%s) in monitoring config snapshot", hostname)
+	}
+
+	for _, c := range caches {
+		status := tc.CacheStatusFromString(c.ServerStatus)
+		if status == tc.CacheStatusOnline || status == tc.CacheStatusReported || status == tc.CacheStatusAdminDown {
+			cacheGroupSet[c.CacheGroup] = allCacheGroups[c.CacheGroup]
+		}
+	}
+
+	if !distributedPolling {
+		return thisTMGroup, cacheGroupSet, nil
+	}
+
+	tmGroups := make([]string, 0, len(tmGroupSet))
+	for tg := range tmGroupSet {
+		tmGroups = append(tmGroups, tg)
+	}
+	sort.Strings(tmGroups)
+	cgs := make([]string, 0, len(cacheGroupSet))
+	for cg := range cacheGroupSet {
+		cgs = append(cgs, cg)
+	}
+	sort.Strings(cgs)

Review comment:
       Yes, for determinism. Map iteration isn't sorted, so we need to sort the keys and iterate over them instead. Otherwise, order wouldn't be guaranteed, and TMs wouldn't all agree on which cachegroups they're supposed to poll.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@trafficcontrol.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficcontrol] rawlinp commented on a change in pull request #6527: Traffic Monitor Distributed Polling

Posted by GitBox <gi...@apache.org>.
rawlinp commented on a change in pull request #6527:
URL: https://github.com/apache/trafficcontrol/pull/6527#discussion_r804124720



##########
File path: traffic_monitor/manager/monitorconfig.go
##########
@@ -337,6 +379,103 @@ func monitorConfigListen(
 	}
 }
 
+// getCacheGroupsToPoll returns the name of this Traffic Monitor's cache group
+// and the set of cache groups it needs to poll.
+func getCacheGroupsToPoll(distributedPolling bool, hostname string, monitors map[string]tc.TrafficMonitor,
+	caches map[string]tc.TrafficServer, allCacheGroups map[string]tc.TMCacheGroup) (string, map[string]tc.TMCacheGroup, error) {
+	tmGroupSet := make(map[string]tc.TMCacheGroup)
+	cacheGroupSet := make(map[string]tc.TMCacheGroup)
+	tmGroupToPolledCacheGroups := make(map[string]map[string]tc.TMCacheGroup)
+	thisTMGroup := ""
+
+	for _, tm := range monitors {
+		if tm.HostName == hostname {
+			thisTMGroup = tm.Location
+		}
+		if tc.CacheStatusFromString(tm.ServerStatus) == tc.CacheStatusOnline {
+			tmGroupSet[tm.Location] = allCacheGroups[tm.Location]
+			if _, ok := tmGroupToPolledCacheGroups[tm.Location]; !ok {
+				tmGroupToPolledCacheGroups[tm.Location] = make(map[string]tc.TMCacheGroup)
+			}
+		}
+	}
+	if thisTMGroup == "" {
+		return "", nil, fmt.Errorf("unable to find cache group for this Traffic Monitor (%s) in monitoring config snapshot", hostname)
+	}
+
+	for _, c := range caches {
+		status := tc.CacheStatusFromString(c.ServerStatus)
+		if status == tc.CacheStatusOnline || status == tc.CacheStatusReported || status == tc.CacheStatusAdminDown {
+			cacheGroupSet[c.CacheGroup] = allCacheGroups[c.CacheGroup]
+		}
+	}
+
+	if !distributedPolling {
+		return thisTMGroup, cacheGroupSet, nil
+	}
+
+	tmGroups := make([]string, 0, len(tmGroupSet))
+	for tg := range tmGroupSet {
+		tmGroups = append(tmGroups, tg)
+	}
+	sort.Strings(tmGroups)
+	cgs := make([]string, 0, len(cacheGroupSet))
+	for cg := range cacheGroupSet {
+		cgs = append(cgs, cg)
+	}
+	sort.Strings(cgs)
+	tmGroupCount := len(tmGroups)
+	var closest string
+	for tmi := 0; len(cgs) > 0; tmi = (tmi + 1) % tmGroupCount {
+		tmGroup := tmGroups[tmi]
+		closest, cgs = findAndRemoveClosestCachegroup(cgs, allCacheGroups[tmGroup], allCacheGroups)
+		tmGroupToPolledCacheGroups[tmGroup][closest] = allCacheGroups[closest]
+	}
+	return thisTMGroup, tmGroupToPolledCacheGroups[thisTMGroup], nil
+}
+
+func findAndRemoveClosestCachegroup(remainingCacheGroups []string, target tc.TMCacheGroup, allCacheGroups map[string]tc.TMCacheGroup) (string, []string) {
+	shortestDistance := math.MaxFloat64
+	shortestIndex := -1
+	for i := 0; i < len(remainingCacheGroups); i++ {
+		distance := getDistance(target, allCacheGroups[remainingCacheGroups[i]])
+		if distance < shortestDistance {
+			shortestDistance = distance
+			shortestIndex = i
+		}
+	}
+	closest := remainingCacheGroups[shortestIndex]
+	remainingCacheGroups = append(remainingCacheGroups[:shortestIndex], remainingCacheGroups[shortestIndex+1:]...)
+	return closest, remainingCacheGroups
+}
+
+const meanEarthRadius = 6371.0
+const x = math.Pi / 180
+
+// toRadians converts degrees to radians.
+func toRadians(d float64) float64 {
+	return d * x
+}
+
+// getDistance gets the great circle distance in kilometers between x and y.
+func getDistance(x, y tc.TMCacheGroup) float64 {

Review comment:
       Maybe? I basically just transliterated this from TR, which I assume is correct because we've been using it for years.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@trafficcontrol.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficcontrol] rawlinp commented on a change in pull request #6527: Traffic Monitor Distributed Polling

Posted by GitBox <gi...@apache.org>.
rawlinp commented on a change in pull request #6527:
URL: https://github.com/apache/trafficcontrol/pull/6527#discussion_r804083303



##########
File path: traffic_monitor/conf/traffic_monitor.cfg
##########
@@ -4,6 +4,8 @@
 	"peer_optimistic": true,
 	"peer_optimistic_quorum_min": 0,
 	"max_events": 200,
+	"stat_polling": true,
+	"distributed_polling": false,

Review comment:
       Generally, all the `ONLINE` TMs should either be all distributed_polling=true or all distributed_polling=false




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@trafficcontrol.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficcontrol] rawlinp commented on a change in pull request #6527: Traffic Monitor Distributed Polling

Posted by GitBox <gi...@apache.org>.
rawlinp commented on a change in pull request #6527:
URL: https://github.com/apache/trafficcontrol/pull/6527#discussion_r804118620



##########
File path: traffic_monitor/datareq/crstate.go
##########
@@ -51,15 +58,36 @@ func srvTRState(params url.Values, localStates peer.CRStatesThreadsafe, combined
 		}
 	}
 
-	data, err := srvTRStateDerived(combinedStates, peerStates)
+	data, err := srvTRStateDerived(combinedStates, local && distributedPollingEnabled)
 
 	return data, http.StatusOK, err
 }
 
-func srvTRStateDerived(combinedStates peer.CRStatesThreadsafe, peerStates peer.CRStatesPeersThreadsafe) ([]byte, error) {
-	return tc.CRStatesMarshall(combinedStates.Get())
+func srvTRStateDerived(combinedStates peer.CRStatesThreadsafe, directlyPolledOnly bool) ([]byte, error) {
+	if !directlyPolledOnly {

Review comment:
       The idea here is that if the `?local` query param is included in the request, the client (a distributed peer) is only interested in the caches that this TM group polls directly. Otherwise, the client is most likely TR or the tc-health-client and is interested in *all* the caches. That is why we do filtering for some request types but not others.
   
   However, now that I think of it, the `combinedStates` might already only be the caches that are directly polled. I'll have to double-check that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@trafficcontrol.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficcontrol] rawlinp commented on a change in pull request #6527: Traffic Monitor Distributed Polling

Posted by GitBox <gi...@apache.org>.
rawlinp commented on a change in pull request #6527:
URL: https://github.com/apache/trafficcontrol/pull/6527#discussion_r804120813



##########
File path: traffic_monitor/manager/monitorconfig.go
##########
@@ -286,25 +307,46 @@ func monitorConfigListen(
 		}
 
 		peerSet := map[tc.TrafficMonitorName]struct{}{}
+		tmsByGroup := make(map[string][]tc.TrafficMonitor)
 		for _, srv := range monitorConfig.TrafficMonitor {
-			if srv.HostName == staticAppData.Hostname {
+			if tc.CacheStatusFromString(srv.ServerStatus) != tc.CacheStatusOnline {
+				continue
+			}
+			tmsByGroup[srv.Location] = append(tmsByGroup[srv.Location], srv)
+		}
+
+		for _, srv := range monitorConfig.TrafficMonitor {
+			if srv.HostName == staticAppData.Hostname || (cfg.DistributedPolling && srv.Location != thisTMGroup) {
 				continue
 			}
 			if tc.CacheStatusFromString(srv.ServerStatus) != tc.CacheStatusOnline {
 				continue
 			}
 			// TODO: the URL should be config driven. -jse
-			url4 := fmt.Sprintf("http://%s:%d/publish/CrStates?raw", srv.IP, srv.Port)
-			url6 := fmt.Sprintf("http://[%s]:%d/publish/CrStates?raw", ipv6CIDRStrToAddr(srv.IP6), srv.Port)
-			peerURLs[srv.HostName] = poller.PollConfig{URL: url4, URLv6: url6, Host: srv.FQDN} // TODO determine timeout.
+			peerURL := fmt.Sprintf("http://%s:%d/publish/CrStates?raw", srv.FQDN, srv.Port)

Review comment:
       Correct, TM peers will no longer oscillate between ipv4 and ipv6, because there is no need. They now just poll each other using their FQDNs (which may resolve to ipv4 or ipv6).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@trafficcontrol.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [trafficcontrol] rawlinp commented on a change in pull request #6527: Traffic Monitor Distributed Polling

Posted by GitBox <gi...@apache.org>.
rawlinp commented on a change in pull request #6527:
URL: https://github.com/apache/trafficcontrol/pull/6527#discussion_r804080620



##########
File path: lib/go-tc/traffic_monitor.go
##########
@@ -438,7 +438,7 @@ type TrafficMonitor struct {
 	Profile string `json:"profile"`
 	// Location is the Name of the Cache Group to which the Traffic Monitor
 	// belongs - called "Location" for legacy reasons.
-	Location string `json:"location"`
+	Location string `json:"cachegroup"`

Review comment:
       This wasn't really clean-up, it was more of a bug. The TM snapshot uses `cachegroup`, but TM was parsing it into this struct using `location`, so it wasn't actually parsing it at all. I think the intention was that all fields in the struct should have GoDoc comments.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@trafficcontrol.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org