You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jonathan Eagles (JIRA)" <ji...@apache.org> on 2017/04/10 17:35:41 UTC

[jira] [Commented] (TEZ-3685) ShuffleHandler completedInputSet off-by-one error

    [ https://issues.apache.org/jira/browse/TEZ-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963234#comment-15963234 ] 

Jonathan Eagles commented on TEZ-3685:
--------------------------------------

[~sseth]
- The unordered and ordered case did have an off by one issue. Fixed this up.
- The unordered shuffle manager uses a mix Reentrant locks and synchronization and causes a difficulty in avoiding deadlocks scenarios. Moved totally to using reentrant locks. Re-worked the locking/unlocking to keep correct synchronization.
- The srcAttemptRemaining verify sanity check was found to be redundant, since the inputAttemptIdentifier has already been verified by looking up the identifier in the pathToAttemptMap that is generated from the srcAttemptsRemaining.

> ShuffleHandler completedInputSet off-by-one error
> -------------------------------------------------
>
>                 Key: TEZ-3685
>                 URL: https://issues.apache.org/jira/browse/TEZ-3685
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Jonathan Eagles
>            Assignee: Jonathan Eagles
>         Attachments: TEZ-3685.1.patch
>
>
> Per comment in TEZ-3334 https://issues.apache.org/jira/browse/TEZ-3334?focusedCommentId=15950550&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15950550
> Addressing The off by 1 issue / other correctness issues in this jira
> - ShuffleManager: has moved to using a BitSet instead of a ConcurrentSet. May need to look at the synchronization points required for the bitset. (access to the map had some synchronized and some non synchronized access)
> - Why was this removed?
> {code}
> -    // Sanity check
> -    // we are guaranteed that key is not null
> -    if (srcAttemptsRemaining.get(srcAttemptId.toString()) == null) {
> -      // wrongMapErrs.increment(1);
> -      LOG.warn("Invalid input. Received output for headerPathComponent: "
> -          + pathComponent + "nextRemainingSrcAttemptId: "
> -          + getNextRemainingAttempt() + ", mappedSrcAttemptId: " + srcAttemptId);
> -      return false;
> -    }
> {code}
> - 
> Is there an off by 1 error here? Specifically. input.getInputIdentifier() + ((CompositeInputAttemptIdentifier) input).getInputIdentifierCount(). Think getInputIdentifierCount is always at least 1.
> {code}
> +      if(input instanceof CompositeInputAttemptIdentifier) {
> +        // LLL Is there an off by 0 case here. i.e. getInputIdentiferCount will be set to 1 for a regular event.
> +        // We start scanning the bitst early. Also depends on how completedInputSet is setup.
> +        if (completedInputSet.nextClearBit(input.getInputIdentifier()) >=
> +            input.getInputIdentifier() + ((CompositeInputAttemptIdentifier) input).getInputIdentifierCount()) {
> +          inputIter.remove();
> +          continue;
> +        }
> +      } else {
> +        if (completedInputSet.get(input.getInputIdentifier())) {
> +          inputIter.remove();
> +          continue;
> +        }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)