You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/07/11 01:33:02 UTC

[GitHub] [incubator-mxnet] DickJC123 opened a new pull request #18694: Unittest tolerance handling improvements

DickJC123 opened a new pull request #18694:
URL: https://github.com/apache/incubator-mxnet/pull/18694


   ## Description ##
   This PR moves to consolidate and standardize the way our data comparison routines in test_utils.py handle tolerances.  Over time, the number of these routines has grown and the approaches have diverged.  For example, two of the most frequently used routines are:
   ```
   def assert_almost_equal(a, b, rtol=None, atol=None, …):
       rtol = get_rtol(rtol)
       atol = get_atol(atol)
       ….
   ```
   and
   ```
   def check_consistency(sym, …,  tol=None, … ):
    if tol is None:
           tol = {np.dtype(np.float16): 1e-1,
                  np.dtype(np.float32): 1e-3,
                  np.dtype(np.float64): 1e-5,
                  …}
   …
   ```
   As show, `assert_almost_equal()` offers separate arguments for specifying relative and absolute tolerances (by default rtol, atol = 1e-5, 1e-20), but these do not vary with dtype.  In contrast, `check_consistency()` only has one tol argument, which it uses for both atol and rtol, but the tol default is dtype-dependent.
   
   This PR unifies these approaches by having both `check_consistency()` and `assert_almost_equal()` support rtol and atol, drawing upon the same set of dtype-dependent defaults.  The goal here is for the testing framework to make sensible tolerance choices for our unittest data comparisons when possible, rather than burden each test developer to do so.  In effect, this PR addresses the current codebase TODO in test_utils.py:
   ```
   def get_rtol(rtol=None):
       """Get default numerical threshold for regression test."""
       # _TODO: get from env variable, different threshold might
       # be needed for different device and dtype
       return 1e-5 if rtol is None else rtol
   ```
   While I believe setting tolerances via env var is not appropriate, test tolerances should be a function of the dtype and device (i.e. context).  A first application for this concept is in support for the newly announced A100 GPU with its TensorFloat-32 (TF32) mode of computation.  The A100 will by default truncate the mantissa of float32 GEMM and Convolution inputs to a float16 precision.  Therefore, although operator i/o’s might be float32, the “effective dtype” for the purposes of tolerance selection is float16.  By consolidating the tolerance handling logic, this PR can now easily incorporate an `effective_dtype(dat)` routine based on context, so that unittests can run 32-bit models and seamlessly apply the appropriate tolerances for each context, be it A100 or non-A100.
   
   This PR was developed on an in-house CI system that runs on many GPU architectures, including the A100.  As part of the PR development, some unittests were modified so that now all tests pass on the A100.  These tests had typically explicitly hard-coded a tolerance appropriate for a float32 test.  The modification for those tests involved dropping the fixed tolerance, allowing this PR’s adaptive context- and dtype-dependent tolerance selection logic to take over.
   
   Thanks to @Kh4L for supplying the GEMM-flag adaptations for TF32 that are part of this PR.
   Thanks also to @drivanov for his prior efforts to adapt `assert_almost_equal()` to work directly on ndarrays, thereby enabling context-dependent default tolerance selection.
   
   @ptrendx @eric-haibin-lin @marcoabreu
   ## Checklist ##
   ### Essentials ###
   Please feel free to remove inapplicable items for your PR.
   - [X ] The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant [JIRA issue](https://issues.apache.org/jira/projects/MXNET/issues) created (except PRs with tiny changes)
   - [ X] Changes are complete (i.e. I finished coding on this PR)
   - [ X] All changes have test coverage:
   - Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
   - Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
   - Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
   - [ X] Code is well-documented: 
   - For user-facing API changes, API doc string has been updated. 
   - For new C++ functions in header files, their functionalities and arguments are documented. 
   - For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
   - Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
   - [X ] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
   
   ### Changes ###
   - [ ] Feature1, tests, (and when applicable, API doc)
   - [ ] Feature2, tests, (and when applicable, API doc)
   
   ## Comments ##
   - If this change is a backward incompatible change, why must this change be made.
   - Interesting edge cases to note here
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on a change in pull request #18694: Unittest tolerance handling improvements

Posted by GitBox <gi...@apache.org>.

szha commented on a change in pull request #18694:
URL: https://github.com/apache/incubator-mxnet/pull/18694#discussion_r455478439



##########
File path: Makefile
##########
@@ -431,7 +431,7 @@ endif
 # be JIT-compiled by the updated driver from the included PTX.
 ifeq ($(USE_CUDA), 1)
 ifeq ($(CUDA_ARCH),)
-	KNOWN_CUDA_ARCHS := 30 35 50 52 60 61 70 75
+	KNOWN_CUDA_ARCHS := 30 35 50 52 60 61 70 75 80

Review comment:
       we are removing makefile in #18721 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha merged pull request #18694: Unittest tolerance handling improvements

Posted by GitBox <gi...@apache.org>.

szha merged pull request #18694:
URL: https://github.com/apache/incubator-mxnet/pull/18694


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] DickJC123 edited a comment on pull request #18694: Unittest tolerance handling improvements

Posted by GitBox <gi...@apache.org>.

DickJC123 edited a comment on pull request #18694:
URL: https://github.com/apache/incubator-mxnet/pull/18694#issuecomment-660720326


   Thanks @szha!  You probably saw that I've been struggling to get a passing CI- running into and fixing many issues unrelated to my PR along the way.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on pull request #18694: Unittest tolerance handling improvements

Posted by GitBox <gi...@apache.org>.

szha commented on pull request #18694:
URL: https://github.com/apache/incubator-mxnet/pull/18694#issuecomment-660710673


   Thanks for the fixes, @DickJC123. They are really helpful. I did a code review and examined the analysis included in each of them as well as the specific fixes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] szha commented on pull request #18694: Unittest tolerance handling improvements

Posted by GitBox <gi...@apache.org>.

szha commented on pull request #18694:
URL: https://github.com/apache/incubator-mxnet/pull/18694#issuecomment-660577521


   @DickJC123 thanks for the change and for fixing the issues you found along the way. Remember to mark the issues this PR resolve in the description.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] DickJC123 commented on pull request #18694: Unittest tolerance handling improvements

Posted by GitBox <gi...@apache.org>.

DickJC123 commented on pull request #18694:
URL: https://github.com/apache/incubator-mxnet/pull/18694#issuecomment-660720326


   Thanks @szha!  You probably saw that I've been struggling to get a passing CI- running into and fixing many issues along the way.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] DickJC123 commented on a change in pull request #18694: Unittest tolerance handling improvements

Posted by GitBox <gi...@apache.org>.

DickJC123 commented on a change in pull request #18694:
URL: https://github.com/apache/incubator-mxnet/pull/18694#discussion_r456164385



##########
File path: Makefile
##########
@@ -431,7 +431,7 @@ endif
 # be JIT-compiled by the updated driver from the included PTX.
 ifeq ($(USE_CUDA), 1)
 ifeq ($(CUDA_ARCH),)
-	KNOWN_CUDA_ARCHS := 30 35 50 52 60 61 70 75
+	KNOWN_CUDA_ARCHS := 30 35 50 52 60 61 70 75 80

Review comment:
       I will revert the commit that added '80' since the Makefile will soon be removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #18694: Unittest tolerance handling improvements

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #18694:
URL: https://github.com/apache/incubator-mxnet/pull/18694#issuecomment-656961124


   Hey @DickJC123 , Thanks for submitting the PR 
   All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands: 
   - To trigger all jobs: @mxnet-bot run ci [all] 
   - To trigger specific jobs: @mxnet-bot run ci [job1, job2] 
   *** 
   **CI supported jobs**: [unix-cpu, unix-gpu, centos-gpu, miscellaneous, clang, windows-gpu, edge, website, windows-cpu, sanity, centos-cpu]
   *** 
   _Note_: 
    Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin. 
   All CI tests must pass before the PR can be merged. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org