You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apex.apache.org by ishark <gi...@git.apache.org> on 2016/03/17 09:03:19 UTC

[GitHub] incubator-apex-core pull request: APEXCORE-393 #resolve Adding Dag...

GitHub user ishark opened a pull request:

    https://github.com/apache/incubator-apex-core/pull/274

    APEXCORE-393 #resolve Adding Dag context attributes with increased de…

    …fault value for blacklisting of failed nodes
    
    Added resetting of failure count for nodes after blacklist removal interval.
    Cleaned up code to remove concurrent Map for maintaining failed counts

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ishark/incubator-apex-core APEXCORE-393

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-apex-core/pull/274.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #274
    
----
commit ab66a1eafd78e75b1dc1608e8ba276f9d1f60db6
Author: ishark <is...@datatorrent.com>
Date:   2016-03-17T00:39:19Z

    APEXCORE-393 #resolve Adding Dag context attributes with increased default value for blacklisting of failed nodes
    Added resetting of failure count for nodes after blacklist removal interval.
    Cleaned up code to remove concurrent Map for maintaining failed counts

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-apex-core pull request: APEXCORE-393 #resolve Adding Dag...

Posted by tweise <gi...@git.apache.org>.
Github user tweise commented on a diff in the pull request:

    https://github.com/apache/incubator-apex-core/pull/274#discussion_r56538000
  
    --- Diff: api/src/main/java/com/datatorrent/api/Context.java ---
    @@ -475,6 +475,21 @@
          * Only supports string codecs that have a constructor with no arguments
          */
         Attribute<Map<Class<?>, Class<? extends StringCodec<?>>>> STRING_CODECS = new Attribute<Map<Class<?>, Class<? extends StringCodec<?>>>>(new Map2String<Class<?>, Class<? extends StringCodec<?>>>(",", "=", new Class2String<Object>(), new Class2String<StringCodec<?>>()));
    +
    +    /**
    +     * The number of consecutive container failures that should lead to
    +     * blacklisting of nodes by application master
    +     */
    +    Attribute<Integer> MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLCKLIST = new Attribute<Integer>(Integer.MAX_VALUE);
    --- End diff --
    
    Should also mention that default value means it is off, which is the current behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-apex-core pull request: APEXCORE-393 #resolve Adding Dag...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-apex-core/pull/274


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-apex-core pull request: APEXCORE-393 #resolve Adding Dag...

Posted by tweise <gi...@git.apache.org>.
Github user tweise commented on a diff in the pull request:

    https://github.com/apache/incubator-apex-core/pull/274#discussion_r56537836
  
    --- Diff: api/src/main/java/com/datatorrent/api/Context.java ---
    @@ -475,6 +475,21 @@
          * Only supports string codecs that have a constructor with no arguments
          */
         Attribute<Map<Class<?>, Class<? extends StringCodec<?>>>> STRING_CODECS = new Attribute<Map<Class<?>, Class<? extends StringCodec<?>>>>(new Map2String<Class<?>, Class<? extends StringCodec<?>>>(",", "=", new Class2String<Object>(), new Class2String<StringCodec<?>>()));
    +
    +    /**
    +     * The number of consecutive container failures that should lead to
    +     * blacklisting of nodes by application master
    +     */
    +    Attribute<Integer> MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLCKLIST = new Attribute<Integer>(Integer.MAX_VALUE);
    --- End diff --
    
    Spelling


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-apex-core pull request: APEXCORE-393 #resolve Adding Dag...

Posted by tweise <gi...@git.apache.org>.
Github user tweise commented on a diff in the pull request:

    https://github.com/apache/incubator-apex-core/pull/274#discussion_r57087518
  
    --- Diff: engine/src/main/java/com/datatorrent/stram/StreamingAppMasterService.java ---
    @@ -781,18 +794,19 @@ private void execute() throws YarnException, IOException
          /* Remove nodes from blacklist after timeout */
           long currentTime = System.currentTimeMillis();
           List<String> blacklistRemovals = new ArrayList<String>();
    -      for (Iterator<Pair<Long, List<String>>> it = blacklistedNodesQueueWithTimeStamp.iterator(); it.hasNext();) {
    -        Pair<Long, List<String>> entry = it.next();
    -        Long timeDiff = currentTime - entry.getFirst();
    -        if (timeDiff > blacklistRemovalTime) {
    -          blacklistRemovals.addAll(entry.getSecond());
    -          it.remove();
    -        } else {
    -          break;
    +      for (String hostname : failedBlackListedNodes) {
    +        Long timeDiff = currentTime - failedContainerNodesMap.get(hostname).blackListAdditionTime;
    +        if (timeDiff >= blacklistRemovalTime) {
    +          blacklistRemovals.add(hostname);
    +          failedContainerNodesMap.remove(hostname);
             }
           }
    +
           if (!blacklistRemovals.isEmpty()) {
             amRmClient.updateBlacklist(null, blacklistRemovals);
    +        LOG.info("Removing nodes {} from blacklist: time elapsed since last blacklisting due to failure is greater than specified timeout", blacklistRemovals.toString());
    +
    --- End diff --
    
    Minor nit, there are a few extra blank lines added...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-apex-core pull request: APEXCORE-393 #resolve Adding Dag...

Posted by tweise <gi...@git.apache.org>.
Github user tweise commented on the pull request:

    https://github.com/apache/incubator-apex-core/pull/274#issuecomment-200081121
  
    @ishark this looks good!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-apex-core pull request: APEXCORE-393 #resolve Adding Dag...

Posted by ishark <gi...@git.apache.org>.
Github user ishark commented on the pull request:

    https://github.com/apache/incubator-apex-core/pull/274#issuecomment-198587733
  
    Addressed review comments.
    @tweise please review.
    
    Have tested following things:
    1. Checked that properties set for Blacklist removal time and max failure count are applied in App Master
    2. Tested that failure count is reset after Blacklist Removal time and that the nodes are removed from blacklist
    3. If time between 2 same node failures is more than Blacklist Removal time, the fail count gets reset to 1.
    4. If a node is already blacklisted due to consecutive failures within specified interval, we do not keep increamenting failure count


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-apex-core pull request: APEXCORE-393 #resolve Adding Dag...

Posted by tweise <gi...@git.apache.org>.
Github user tweise commented on a diff in the pull request:

    https://github.com/apache/incubator-apex-core/pull/274#discussion_r56538171
  
    --- Diff: api/src/main/java/com/datatorrent/api/Context.java ---
    @@ -475,6 +475,21 @@
          * Only supports string codecs that have a constructor with no arguments
          */
         Attribute<Map<Class<?>, Class<? extends StringCodec<?>>>> STRING_CODECS = new Attribute<Map<Class<?>, Class<? extends StringCodec<?>>>>(new Map2String<Class<?>, Class<? extends StringCodec<?>>>(",", "=", new Class2String<Object>(), new Class2String<StringCodec<?>>()));
    +
    +    /**
    +     * The number of consecutive container failures that should lead to
    +     * blacklisting of nodes by application master
    +     */
    +    Attribute<Integer> MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLCKLIST = new Attribute<Integer>(Integer.MAX_VALUE);
    +
    +    /**
    +     * The amount of time to wait before removing failed nodes from blacklist
    +     */
    +    Attribute<Long> BLACKLIST_REMOVAL_TIME_MILLIS = new Attribute<Long>(new Long(60 * 60 * 1000));
    --- End diff --
    
    Include "NODE" in the attribute name?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-apex-core pull request: APEXCORE-393 #resolve Adding Dag...

Posted by ishark <gi...@git.apache.org>.
Github user ishark commented on a diff in the pull request:

    https://github.com/apache/incubator-apex-core/pull/274#discussion_r57090295
  
    --- Diff: engine/src/main/java/com/datatorrent/stram/StreamingAppMasterService.java ---
    @@ -950,7 +977,11 @@ private void execute() throws YarnException, IOException
     
           if (!blacklistAdditions.isEmpty()) {
             amRmClient.updateBlacklist(blacklistAdditions, null);
    -        blacklistedNodesQueueWithTimeStamp.add(new Pair<Long, List<String>>(System.currentTimeMillis(), blacklistAdditions));
    +        long timeStamp = System.currentTimeMillis();
    +        for(String hostname : blacklistAdditions) {
    --- End diff --
    
    Addressed the checkstyle errors. Thus didn't get caught in checkstyle because max allowed violation count was greater. Have reduced it now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-apex-core pull request: APEXCORE-393 #resolve Adding Dag...

Posted by tweise <gi...@git.apache.org>.
Github user tweise commented on a diff in the pull request:

    https://github.com/apache/incubator-apex-core/pull/274#discussion_r57088465
  
    --- Diff: engine/src/main/java/com/datatorrent/stram/StreamingAppMasterService.java ---
    @@ -950,7 +977,11 @@ private void execute() throws YarnException, IOException
     
           if (!blacklistAdditions.isEmpty()) {
             amRmClient.updateBlacklist(blacklistAdditions, null);
    -        blacklistedNodesQueueWithTimeStamp.add(new Pair<Long, List<String>>(System.currentTimeMillis(), blacklistAdditions));
    +        long timeStamp = System.currentTimeMillis();
    +        for(String hostname : blacklistAdditions) {
    --- End diff --
    
    another nit, space after "for" missing, surprised it passes checkstyle?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---