You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/01/11 20:17:50 UTC

[GitHub] ctcyang opened a new pull request #13852: Fix Tree Reduction on new instance type p3dn.24xlarge

ctcyang opened a new pull request #13852: Fix Tree Reduction on new instance type p3dn.24xlarge
URL: https://github.com/apache/incubator-mxnet/pull/13852
 
 
   ## Description ##
   Solves the issue raised here: https://github.com/dmlc/gluon-nlp/issues/520
   
   ## Checklist ##
   ### Essentials ###
   Please feel free to remove inapplicable items for your PR.
   - [ ] The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant [JIRA issue](https://issues.apache.org/jira/projects/MXNET/issues) created (except PRs with tiny changes)
   - [x] Changes are complete (i.e. I finished coding on this PR)
   - [x] All changes have test coverage:
   - Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
   - Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
   - Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
   - [ ] Code is well-documented: 
   - For user-facing API changes, API doc string has been updated. 
   - For new C++ functions in header files, their functionalities and arguments are documented. 
   - For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
   - Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
   - [x] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
   
   ### Changes ###
   - [x] Adds a fallback for `cudaDeviceGetP2PAttribute` topology information to use `cudaDeviceEnablePeerAccess` when the former is inconsistent across instance types.
   
   ## Comments ##
   - On p3dn.24xlarge using CUDA 9.0, we observe that the connection topology looks different compared to when we use CUDA 9.0 with p3.16xlarge:
   ```  // Check that all P2P connections are detected by GetP2PAttribute
     // If yes, then continue as before
     // If not, then treat fallback to using p2p_matrix (from EnableP2P() in src/kvstore/comm_tree.h)
     //
     // We have observed that with CUDA 9.0 p3.16xlarge:
     //
     //   0 2 2 3 3 1 1 1                . v v v v . . .
     //   2 0 3 2 1 3 1 1                v . v v . v . .
     //   2 3 0 3 1 1 2 1                v v . v . . v .
     //   3 2 3 0 1 1 1 2                v v v . . . . v
     //   3 1 1 1 0 2 2 3                v . . . . v v v
     //   1 3 1 1 2 0 3 2                . v . . v . v v
     //   1 1 2 1 2 3 0 3                . . v . v v . v
     //   1 1 1 2 3 2 3 0                . . . v v v v .
     //
     //        matrix                       p2p_matrix
     // cudaDeviceGetP2PAttribute   cudaDeviceEnablePeerAccess
     //
     // Here, they are correctly detected, because the 2s and 3s correspond to
     // links that have P2P connections between them. However for CUDA 9.0 p3dn.24xlarge:
     //
     //   0 2 2 1 1 1 1 1                . v v v v . . .
     //   2 0 1 2 1 1 1 1                v . v v . v . .
     //   2 1 0 1 1 1 2 1                v v . v . . v .
     //   1 2 1 0 1 1 1 2                v v v . . . . v
     //   1 1 1 1 0 2 2 1                v . . . . v v v
     //   1 1 1 1 2 0 1 2                . v . . v . v v
     //   1 1 2 1 2 1 0 1                . . v . v v . v
     //   1 1 1 2 1 2 1 0                . . . v v v v .
     //  
     //        matrix                      p2p_matrix
     // cudaDeviceGetP2PAttribute   cudaDeviceEnablePeerAccess
     //
     // The fastest connections (3 i.e. double NVLink) are not recognized as being a connection
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services