You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@knox.apache.org by "Sandor Molnar (Jira)" <ji...@apache.org> on 2021/03/14 23:07:00 UTC

[jira] [Created] (KNOX-2549) Knox CM discovery may lose relevant audit events

Sandor Molnar created KNOX-2549:
-----------------------------------

             Summary: Knox CM discovery may lose relevant audit events
                 Key: KNOX-2549
                 URL: https://issues.apache.org/jira/browse/KNOX-2549
             Project: Apache Knox
          Issue Type: Bug
          Components: cm-discovery
    Affects Versions: 1.5.0
            Reporter: Sandor Molnar
            Assignee: Sandor Molnar
             Fix For: 1.6.0


 *Test steps:*
1. Configure Knox service discovery; Knox is in HA
1. Updated hive on tez-cm config {{hive.server2.transport.mode}} from binary to http
2. Restarted {{Hive on Tez}} service

*Expected:*
After 2 mins, check if the HIVE service have the updated endpoint URL in the relevant topology 

*{color:#de350b}Issue:{color}*
HIVE endpoint URL is updated on only one of the Knox instances.

 
Let's see the relevant gateway logs and the related commands from CM of my new findings:
{noformat}
gateway.log - knox-1-host
-----------------------------
2020-12-03 00:32:23,751 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @ https://my-cm-host:7183 for configuration changes...
2020-12-03 00:32:23,769 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying configuration activation events from Cluster 1 @ https://my-cm-host:7183 since 2020-12-03T08:31:23.681Z
2020-12-03 00:32:23,861 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event relevance: e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2 = false
2020-12-03 00:32:23,861 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event relevance: d3176932-8c50-4c42-879c-7e25ae9e2cbd = false
2020-12-03 00:32:32,042 DEBUG knox.gateway (ClouderaManagerDescriptorMonitor.java:monitorClouderaManagerDescriptor(96)) - Skipping CM descriptor monitoring of /var/lib/knox/gateway/conf/descriptors/cdp-resources.cm (force = false; lastReloadTime = 2020-12-03T08:20:24.486931Z; lastModified = 2020-12-03T08:20:24.486931Z)
...
2020-12-03 00:33:23,862 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @ https://my-cm-host:7183 for configuration changes...
2020-12-03 00:33:23,874 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying configuration activation events from Cluster 1 @ https://my-cm-host:7183 since 2020-12-03T08:32:23.769Z
2020-12-03 00:33:23,938 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:getRelevantEvents(402)) - There is no any activation event found within the given time period


gateway.log - knox-host-2
-----------------------------

2020-12-03 00:32:07,745 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @ https://my-cm-host:7183 for configuration changes...
2020-12-03 00:32:07,920 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying configuration activation events from Cluster 1 @ https://my-cm-host:7183 since 2020-12-03T08:31:07.670Z
2020-12-03 00:32:08,039 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event relevance: e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2 = false
2020-12-03 00:32:08,043 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event relevance: d3176932-8c50-4c42-879c-7e25ae9e2cbd = false
...
2020-12-03 00:33:08,045 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:run(182)) - Checking Cluster 1 @ https://my-cm-host:7183 for configuration changes...
2020-12-03 00:33:08,083 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:getRelevantEvents(393)) - Querying configuration activation events from Cluster 1 @ https://my-cm-host:7183 since 2020-12-03T08:32:07.922Z
2020-12-03 00:33:08,184 DEBUG discovery.cm (PollingConfigurationAnalyzer.java:isRelevantEvent(422)) - Activation event relevance: 7537adcb-d7da-4591-aa66-1f857584c099 = true


AUDIT events fetched from CM
----------------------------

https://my-cm-host:7183/api/v41/events?query=category==AUDIT_EVENT;timeOccurred=gt=2020-12-03T08:31:23.681Z;timeOccurred=lt=2020-12-03T08:32:23.769Z
{
	totalResults: 3,
	items: [{
			id: "7537adcb-d7da-4591-aa66-1f857584c099",
			content: "Command Start on subject ATLAS-1 succeeded.",
			timeOccurred: "2020-12-03T08:32:20.366Z",
			timeReceived: "2020-12-03T08:32:20.369Z",
			category: "AUDIT_EVENT",
			severity: "INFORMATIONAL",
			alert: false,
			attributes: [{
					values: [
						"SvcCmdArgs{targetRoles=[], args=[]}"
					],
					name: "COMMAND_ARGS"
				},
				{
					values: [
						"Successfully started service."
					],
					name: "COMMAND_RESULT"
				},
				{
					values: [
						"556"
					],
					name: "COMMAND_ID"
				},
				{
					values: [
						"clouderaManager"
					],
					name: "USER"
				},
				{
					values: [
						"Cluster 1"
					],
					name: "CLUSTER_DISPLAY_NAME"
				},
				{
					values: [
						"https://my-cm-host5003/cmf/eventRedirect/7537adcb-d7da-4591-aa66-1f857584c099"
					],
					name: "URL"
				},
				{
					values: [
						"ATLAS-1"
					],
					name: "SERVICE_DISPLAY_NAME"
				},
				{
					values: [
						"SUCCEEDED"
					],
					name: "COMMAND_STATUS"
				},
				{
					values: [
						"EV_SERVICE_STARTED"
					],
					name: "EVENTCODE"
				},
				{
					values: [
						"false"
					],
					name: "ALERT_SUPPRESSED"
				},
				{
					values: [
						"Start"
					],
					name: "COMMAND"
				},
				{
					values: [
						"ATLAS-1"
					],
					name: "SERVICE"
				},
				{
					values: [
						"ATLAS"
					],
					name: "SERVICE_TYPE"
				},
				{
					values: [
						"1"
					],
					name: "CLUSTER_ID"
				},
				{
					values: [
						"Cluster 1"
					],
					name: "CLUSTER"
				},
				{
					values: [
						"COMMAND_SUCCEEDED_WITH_TARGET"
					],
					name: "MESSAGE_CODES"
				}
			]
		},
		{
			id: "e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2",
			content: "User admin created a new revision. Message was: Modified Server HTTPS Port.",
			timeOccurred: "2020-12-03T08:32:01.769Z",
			timeReceived: "2020-12-03T08:32:01.772Z",
			category: "AUDIT_EVENT",
			severity: "INFORMATIONAL",
			alert: false,
			attributes: [{
					values: [
						"257"
					],
					name: "REVISION"
				},
				{
					values: [
						"admin"
					],
					name: "USER"
				},
				{
					values: [
						"Cluster 1"
					],
					name: "CLUSTER_DISPLAY_NAME"
				},
				{
					values: [
						"https://knox-host-2:5003/cmf/eventRedirect/e8cd499f-49ad-4412-9d3a-7a4cbad3e5c2"
					],
					name: "URL"
				},
				{
					values: [
						"ATLAS-1-ATLAS_SERVER-353d9af9c42d35ad516e1b1ff11f3a63"
					],
					name: "ROLE"
				},
				{
					values: [
						"ATLAS-1-ATLAS_SERVER-BASE"
					],
					name: "ROLE_CONFIG_GROUP"
				},
				{
					values: [
						"ATLAS-1"
					],
					name: "SERVICE_DISPLAY_NAME"
				},
				{
					values: [
						"EV_REVISION_CREATED"
					],
					name: "EVENTCODE"
				},
				{
					values: [
						"false"
					],
					name: "ALERT_SUPPRESSED"
				},
				{
					values: [
						"ATLAS-1"
					],
					name: "SERVICE"
				},
				{
					values: [
						"ATLAS"
					],
					name: "SERVICE_TYPE"
				},
				{
					values: [
						"1"
					],
					name: "CLUSTER_ID"
				},
				{
					values: [
						"Cluster 1"
					],
					name: "CLUSTER"
				},
				{
					values: [
						"REVISION_CREATED"
					],
					name: "MESSAGE_CODES"
				}
			]
		},
		{
			id: "d3176932-8c50-4c42-879c-7e25ae9e2cbd",
			content: "User admin started service Start.",
			timeOccurred: "2020-12-03T08:31:58.360Z",
			timeReceived: "2020-12-03T08:31:58.361Z",
			category: "AUDIT_EVENT",
			severity: "INFORMATIONAL",
			alert: false,
			attributes: [{
					values: [
						"SvcCmdArgs{targetRoles=[DbRole{id=65, name=ATLAS-1-ATLAS_SERVER-353d9af9c42d35ad516e1b1ff11f3a63, hostName=knox-host-2}], args=[]}"
					],
					name: "COMMAND_ARGS"
				},
				{
					values: [
						"556"
					],
					name: "COMMAND_ID"
				},
				{
					values: [
						"admin"
					],
					name: "USER"
				},
				{
					values: [
						"Cluster 1"
					],
					name: "CLUSTER_DISPLAY_NAME"
				},
				{
					values: [
						"https://knox-host-2:5003/cmf/eventRedirect/d3176932-8c50-4c42-879c-7e25ae9e2cbd"
					],
					name: "URL"
				},
				{
					values: [
						"ATLAS-1"
					],
					name: "SERVICE_DISPLAY_NAME"
				},
				{
					values: [
						"STARTED"
					],
					name: "COMMAND_STATUS"
				},
				{
					values: [
						"EV_SERVICE_STARTED"
					],
					name: "EVENTCODE"
				},
				{
					values: [
						"false"
					],
					name: "ALERT_SUPPRESSED"
				},
				{
					values: [
						"Start"
					],
					name: "COMMAND"
				},
				{
					values: [
						"ATLAS-1"
					],
					name: "SERVICE"
				},
				{
					values: [
						"ATLAS"
					],
					name: "SERVICE_TYPE"
				},
				{
					values: [
						"1"
					],
					name: "CLUSTER_ID"
				},
				{
					values: [
						"Cluster 1"
					],
					name: "CLUSTER"
				},
				{
					values: [
						"SERVICE_STARTED"
					],
					name: "MESSAGE_CODES"
				}
			]
		}
	]
}
{noformat}
On {{knox-host-2}} everything worked as expected. However, on {{knox-host-1}}, my ATLAS service configuration change got lost:
 - The start command successfully finished on {{2020-12-03T08:32:20.366Z}}
 - Polling configuration analyzer fetched audit events from CM server at {{2020-12-03 08:32:23,769Z}} since {{2020-12-03T08:31:23.681Z}}. However, CM did +_not_+ return the start command that finished 3 seconds ago, even it should have returned it
 - A minute later, the polling configuration analyzer fetched audit events from CM server at {{2020-12-03 08:33:23,874Z}} since {{2020-12-03T08:32:23.769Z}} (the last query timestamp). This query returned nothing which is correct since there was no activation within this time frame

As a result, the command with ID of {{7537adcb-d7da-4591-aa66-1f857584c099}} has been lost on {{knox-host-1}}. It's important to note that the first timestamp, when Knox issued a CM query, was sort of close to the timestamp when the successful start event occurred. There seems to be a lag between the CM server/agent communication which may cause this issue. This means Knox should be smarter in this respect and try to mitigate the chance of missing relevant audit events.

I'm going to implement the following approach:
 - Knox will look back a little bit more time than is does currently: instead of checking the {{[lastQueryTimestamp -> currentTimestamp]}} interval, we'll check {{[lastQueryTimestamp - (2 x pollingInterval) -> currentTimestamp]}}
 - Knox will maintain a cache of all relevant events that it processed to avoid re-processing them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)