You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Stephen O'Donnell (Jira)" <ji...@apache.org> on 2022/02/17 10:38:00 UTC

[jira] [Resolved] (HDDS-6314) ConcurrentModificationException getting SCM Container Metrics

     [ https://issues.apache.org/jira/browse/HDDS-6314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stephen O'Donnell resolved HDDS-6314.
-------------------------------------
    Fix Version/s: 1.3.0
       Resolution: Fixed

> ConcurrentModificationException getting SCM Container Metrics
> -------------------------------------------------------------
>
>                 Key: HDDS-6314
>                 URL: https://issues.apache.org/jira/browse/HDDS-6314
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>            Reporter: Stephen O'Donnell
>            Assignee: Aswin Shakil Balasubramanian
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.3.0
>
>
> Benchmarking the ContainerReports, I noticed the following ConcurrentModificationException in the SCM logs. As I understand it, this will happen anytime the metrics are built and a new container is added at the same time:
> {code:java}
> scm_1       | 2022-02-13 21:32:40,505 [Timer for 'StorageContainerManager' metrics system] ERROR impl.MetricsSourceAdapter: Error getting metrics from source SCMContainerMetrics
> scm_1       | java.util.ConcurrentModificationException
> scm_1       |     at java.base/java.util.TreeMap$KeySpliterator.forEachRemaining(TreeMap.java:2750)
> scm_1       |     at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> scm_1       |     at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> scm_1       |     at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
> scm_1       |     at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> scm_1       |     at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
> scm_1       |     at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getContainers(ContainerManagerImpl.java:172)
> scm_1       |     at org.apache.hadoop.hdds.scm.server.StorageContainerManager.getContainerStateCount(StorageContainerManager.java:1764)
> scm_1       |     at org.apache.hadoop.hdds.scm.server.SCMContainerMetrics.getMetrics(SCMContainerMetrics.java:68)
> scm_1       |     at org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.getMetrics(MetricsSourceAdapter.java:200)
> scm_1       |     at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.snapshotMetrics(MetricsSystemImpl.java:423)
> scm_1       |     at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.sampleMetrics(MetricsSystemImpl.java:410)
> scm_1       |     at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.onTimerEvent(MetricsSystemImpl.java:385)
> scm_1       |     at org.apache.hadoop.metrics2.impl.MetricsSystemImpl$4.run(MetricsSystemImpl.java:372)
> scm_1       |     at java.base/java.util.TimerThread.mainLoop(Timer.java:556)
> scm_1       |     at java.base/java.util.TimerThread.run(Timer.java:506)
> {code}
> The metrics code calls into ContainerManagerImpl.getContainers:
> {code:java}
>   @Override
>   public List<ContainerInfo> getContainers(final LifeCycleState state) {
>     return containerStateManager.getContainerIDs(state).stream()
>         .map(ContainerID::getProtobuf)
>         .map(containerStateManager::getContainer)
>         .filter(Objects::nonNull).collect(Collectors.toList());
>   }
> {code}
> This is quite an expensive operation for any containers where there are a large number in a state (eg CLOSED containers), as we take a TreeSet stored in ContainerStateManager, map the ContainerID to proto, only for it to be mapped back to a ContainerID again inside containerStateManager::getContainer. Then it checks if the container exists and pulls the results into a new list. There the metrics manager just calls size on it.
> This work does not seem necessary, as ContainerStateMap maintains the structures for a container under a lock and hence if the container is in the state map, it must be in the container map too.
> To avoid this seemingly expensive operation on a large cluster with millions of containers, we should probably provide a `getContainerStateCount(final LifeCycleState state)` API into ContainerManager where it ultimately calls `size` on the relevant map inside ContainerStateMap.
> However the two getContainers(...) methods in ContainerManagerImpl are not threadsafe as they stand, and are used in other places (eg from the CLI to list containers), so they need fixed too.
> Additionally, in SCMClientProtocolServer.listContainer, the code does not seem to use the fact that the "container lists by state" are already sorted and in TreeMaps, so they could perform the paging much more efficiently. I fear we are potentially forming some very large lists twice per call, only to return a small slice of the list each time. I have raised HDDS-6315 for this issue, to keep the issues separate.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org