You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Ethan Rose (Jira)" <ji...@apache.org> on 2021/10/20 20:40:08 UTC

[jira] [Updated] (HDDS-3600) ManagedChannels leaked on ratis pipeline when there are many connection retries

     [ https://issues.apache.org/jira/browse/HDDS-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ethan Rose updated HDDS-3600:
-----------------------------
    Target Version/s: 1.3.0  (was: 1.2.0)

I am managing the 1.2.0 release and we currently have more than 600 issues targeted for 1.2.0. I am moving the target field to 1.3.0.

If you are actively working on this jira and believe this should be targeted for the 1.2.0 release, Please reach out to me via Apache email or Slack.

> ManagedChannels leaked on ratis pipeline when there are many connection retries
> -------------------------------------------------------------------------------
>
>                 Key: HDDS-3600
>                 URL: https://issues.apache.org/jira/browse/HDDS-3600
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Client
>    Affects Versions: 1.0.0
>            Reporter: Rakesh Radhakrishnan
>            Priority: Critical
>              Labels: TriagePending
>         Attachments: HeapHistogram-Snapshot-ManagedChannel-Leaked-001.png, outloggenerator-ozonefs-003.log
>
>
> ManagedChannels leaked on ratis pipeline when there are many connection retries
> Observed that too many ManagedChannels opened while running Synthetic Hadoop load generator.
>  Ran benchmark with only one pipeline in the cluster and also ran with only two pipelines in the cluster. 
>  Both the run failed with too many open files and could see many open TCP connections for long time and suspecting channel leaks..
> More details below:
>  *1)* Execute NNloadGenerator
> {code:java}
> [rakeshr@ve1320 loadOutput]$ ps -ef | grep load
> hdfs     362822      1 19 05:24 pts/0    00:03:16 /usr/java/jdk1.8.0_232-cloudera/bin/java -Dproc_jar -Xmx825955249 -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dyarn.log.dir=/var/log/hadoop-yarn -Dyarn.log.file=hadoop.log -Dyarn.home.dir=/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/lib/hadoop/libexec/../../hadoop-yarn -Dyarn.root.logger=INFO,console -Djava.library.path=/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/lib/hadoop/lib/native -Dhadoop.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/lib/hadoop -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console -Dhadoop.policy.file=hadoop-policy.xml -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/jars/hadoop-mapreduce-client-jobclient-3.1.1.7.2.0.0-141-tests.jar NNloadGenerator -root o3fs://bucket2.vol2/
> rakeshr  368739 354174  0 05:41 pts/0    00:00:00 grep --color=auto load
> {code}
> *2)* Active 9858 TCP connections during the run, which is ratis pipeline default port.
> {code:java}
> [rakeshr@ve1320 loadOutput]$ sudo lsof -a -p 362822 | grep "9858" | wc
>    3229   32290  494080
> [rakeshr@ve1320 loadOutput]$ vi tcp_log
> ............
> java    440633 hdfs 4090u     IPv4          271141987       0t0        TCP ve1320.halxg.cloudera.com:35190->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> java    440633 hdfs 4091u     IPv4          271127918       0t0        TCP ve1320.halxg.cloudera.com:35192->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> java    440633 hdfs 4092u     IPv4          271038583       0t0        TCP ve1320.halxg.cloudera.com:59116->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> java    440633 hdfs 4093u     IPv4          271038584       0t0        TCP ve1320.halxg.cloudera.com:59118->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> java    440633 hdfs 4095u     IPv4          271127920       0t0        TCP ve1320.halxg.cloudera.com:35196->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> [rakeshr@ve1320 loadOutput]$ ^C
>  {code}
> *3)* heapdump shows there are 9571 ManagedChanel objects. Heapdump is quite large and attached snapshot to this jira.
> *4)* Attached output and threadump of the SyntheticLoadGenerator benchmark client process to show the exceptions printed to the console. FYI, this file was quite large and have trimmed few repeated exception traces..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org