You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Eric Badger (Jira)" <ji...@apache.org> on 2021/01/12 21:25:00 UTC

[jira] [Commented] (YARN-10562) Follow up changes for YARN-9833

    [ https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263705#comment-17263705 ] 

Eric Badger commented on YARN-10562:
------------------------------------

+1 on the patch. As mentioned above, there is still the race in the code based on the fact that the caller doesn't have to get all Dirs at the same time. But the only issue that this will cause is the dirs being out of date for that iteration. The next time they get a copy, it will be updated. And the list will always be well-constructed. It just has the possibility of being out of sync when compared with the other lists. 

Will wait for the precommit to come back and commit if there are no errors and no objections

> Follow up changes for YARN-9833
> -------------------------------
>
>                 Key: YARN-10562
>                 URL: https://issues.apache.org/jira/browse/YARN-10562
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: yarn
>    Affects Versions: 3.4.0
>            Reporter: Jim Brennan
>            Assignee: Jim Brennan
>            Priority: Major
>         Attachments: YARN-10562.001.patch, YARN-10562.002.patch, YARN-10562.003.patch, YARN-10562.004.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and related methods were returning an unmodifiable view of the lists. These accesses were protected by read/write locks, but because the lists are CopyOnWriteArrayLists, subsequent changes to the list, even when done under the writelock, were exposed when a caller started iterating the list view. CopyOnWriteArrayLists cache the current underlying list in the iterator, so it is safe to iterate them even while they are being changed - at least the view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them from scratch every time, so if a caller called getGoodDirs() just before checkDirs cleared it, and then started iterating right after the clear, they could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to return a copy of the list, which definitely fixes the race condition. The disadvantage is that now we create a new copy of these lists every time we launch a container. The advantage using CopyOnWriteArrayList was that the lists should rarely ever change, and we can avoid all the copying. Unfortunately, the way checkDirs() was written, it guaranteed that it would modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just rewrites checkDirs() to minimize the changes to the underlying lists. There are still some small windows where a disk will have been added to one list, but not yet removed from another if you hit it just right, but I think these should be pretty rare and relatively harmless, and in the vast majority of cases I suspect only one disk will be moving from one list to another at any time.   The question is whether this type of inconsistency (which was always there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org