You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jim Brennan (Jira)" <ji...@apache.org> on 2021/01/06 22:40:00 UTC
[jira] [Commented] (YARN-9833) Race condition when
DirectoryCollection.checkDirs() runs during container launch
[ https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260089#comment-17260089 ]
Jim Brennan commented on YARN-9833:
-----------------------------------
[~ebadger], [~pbacsko], I filed a new Jira so I could put up the alternate solution [YARN-10562].
If we decide not to go with that approach, then I think we should file a follow-up Jira to fix the errorDirs issue and while we are at it, remove the use of CopyOnWriteArrayList for these, as it is pretty wasteful to use it now.
> Race condition when DirectoryCollection.checkDirs() runs during container launch
> --------------------------------------------------------------------------------
>
> Key: YARN-9833
> URL: https://issues.apache.org/jira/browse/YARN-9833
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 3.2.0
> Reporter: Peter Bacsko
> Assignee: Peter Bacsko
> Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three collections:
> {code:java}
> this.writeLock.lock();
> try {
> localDirs.clear();
> errorDirs.clear();
> fullDirs.clear();
> ...
> {code}
> This happens in critical section guarded by a write lock. When we start a container, we retrieve the local dirs by calling {{dirsHandler.getLocalDirs();}} which in turn invokes {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List<String> getGoodDirs() {
> this.readLock.lock();
> try {
> return Collections.unmodifiableList(localDirs);
> } finally {
> this.readLock.unlock();
> }
> }
> {code}
> So we're also in a critical section guarded by the lock. But {{Collections.unmodifiableList()}} only returns a _view_ of the collection, not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org