You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-issues@hadoop.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/10/21 07:28:00 UTC
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621738#comment-17621738 ]
ASF GitHub Bot commented on HDFS-16550:
---------------------------------------
hadoop-yetus commented on PR #4209:
URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1286567852
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|:----:|----------:|--------:|:--------:|:-------:|
| +0 :ok: | reexec | 0m 39s | | Docker mode activated. |
|||| _ Prechecks _ |
| +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. |
| +0 :ok: | codespell | 0m 1s | | codespell was not available. |
| +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. |
| +0 :ok: | xmllint | 0m 1s | | xmllint was not available. |
| +0 :ok: | markdownlint | 0m 1s | | markdownlint was not available. |
| +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. |
| +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. |
|||| _ trunk Compile Tests _ |
| +1 :green_heart: | mvninstall | 42m 0s | | trunk passed |
| +1 :green_heart: | compile | 1m 34s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 |
| +1 :green_heart: | compile | 1m 29s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | checkstyle | 1m 21s | | trunk passed |
| +1 :green_heart: | mvnsite | 1m 37s | | trunk passed |
| +1 :green_heart: | javadoc | 1m 20s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 |
| +1 :green_heart: | javadoc | 1m 44s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | spotbugs | 3m 39s | | trunk passed |
| +1 :green_heart: | shadedclient | 23m 18s | | branch has no errors when building and testing our client artifacts. |
|||| _ Patch Compile Tests _ |
| +1 :green_heart: | mvninstall | 1m 24s | | the patch passed |
| +1 :green_heart: | compile | 1m 26s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 |
| +1 :green_heart: | javac | 1m 26s | | the patch passed |
| +1 :green_heart: | compile | 1m 20s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | javac | 1m 20s | | the patch passed |
| -1 :x: | blanks | 0m 0s | [/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/2/artifact/out/blanks-eol.txt) | The patch has 3 line(s) that end in blanks. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply |
| -0 :warning: | checkstyle | 1m 2s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 201 unchanged - 0 fixed = 203 total (was 201) |
| +1 :green_heart: | mvnsite | 1m 27s | | the patch passed |
| +1 :green_heart: | javadoc | 0m 57s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 |
| +1 :green_heart: | javadoc | 1m 28s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | spotbugs | 3m 17s | | the patch passed |
| +1 :green_heart: | shadedclient | 23m 2s | | patch has no errors when building and testing our client artifacts. |
|||| _ Other Tests _ |
| +1 :green_heart: | unit | 244m 16s | | hadoop-hdfs in the patch passed. |
| +1 :green_heart: | asflicense | 1m 3s | | The patch does not generate ASF License warnings. |
| | | 357m 42s | | |
| Subsystem | Report/Notes |
|----------:|:-------------|
| Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/2/artifact/out/Dockerfile |
| GITHUB PR | https://github.com/apache/hadoop/pull/4209 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint markdownlint |
| uname | Linux 01507fad7bdc 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/bin/hadoop.sh |
| git revision | trunk / d18fa4a4b6296268d56c831da39e0d26329cfb0d |
| Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
| Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
| Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/2/testReport/ |
| Max. process+thread count | 3308 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
| Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/2/console |
| versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
| Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
This message was automatically generated.
> [SBN read] Improper cache-size for journal node may cause cluster crash
> -----------------------------------------------------------------------
>
> Key: HDFS-16550
> URL: https://issues.apache.org/jira/browse/HDFS-16550
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Tao Li
> Assignee: Tao Li
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2022-04-21-09-54-29-751.png, image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> When we introduced {*}SBN Read{*}, we encountered a situation during upgrade the JournalNodes.
> Cluster Info:
> *Active: nn0*
> *Standby: nn1*
> 1. Rolling restart journal node. {color:#ff0000}(related config: fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}
> 2. The cluster runs for a while, edits cache usage is increasing and memory is used up.
> 3. {color:#ff0000}Active namenode(nn0){color} shutdown because of “{_}Timed out waiting 120000ms for a quorum of nodes to respond”{_}.
> 4. Transfer nn1 to Active state.
> 5. {color:#ff0000}New Active namenode(nn1){color} also shutdown because of “{_}Timed out waiting 120000ms for a quorum of nodes to respond” too{_}.
> 6. {color:#ff0000}The cluster crashed{color}.
>
> Related code:
> {code:java}
> JournaledEditsCache(Configuration conf) {
> capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
> DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
> if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
> Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
> "maximum JVM memory is only %d bytes. It is recommended that you " +
> "decrease the cache size or increase the heap size.",
> capacity, Runtime.getRuntime().maxMemory()));
> }
> Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
> "of bytes: " + capacity);
> ReadWriteLock lock = new ReentrantReadWriteLock(true);
> readLock = new AutoCloseableLock(lock.readLock());
> writeLock = new AutoCloseableLock(lock.writeLock());
> initialize(INVALID_TXN_ID);
> } {code}
> Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size than the memory requested by the process. If {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * Runtime.getruntime().maxMemory(){*}, only warn logs are printed during journalnode startup. This can easily be overlooked by users. However, as the cluster runs to a certain period of time, it is likely to cause the cluster to crash.
>
> NN log:
> !image-2022-04-21-09-54-57-111.png|width=1012,height=47!
> !image-2022-04-21-12-32-56-170.png|width=809,height=218!
> IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and {color:#ff0000}fast fail{color}. Giving a clear hint for users to update related configurations. Or if cache-size exceeds 50% (or some other threshold) of maxMemory, force cache-size to be 25% of maxMemory.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org