You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2020/01/07 21:25:24 UTC

[GitHub] [incubator-tvm] tqchen opened a new issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py::test_topi_depthwise_conv2d_backward_weight_nhwc

tqchen opened a new issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py::test_topi_depthwise_conv2d_backward_weight_nhwc
URL: https://github.com/apache/incubator-tvm/issues/4646
 
 
   NOTE: there has been quite a few CI failures on the master since #4511 is merged(which may not may not tie to the particular PR), these errors are quite consistent on the CI instance aws.g4.n0.cuda0 but will not happen in other instances.
   
   - https://ci.tvm.ai/blue/rest/organizations/jenkins/pipelines/tvm/branches/master/runs/257/nodes/244/log/?start=0
   - https://ci.tvm.ai/blue/rest/organizations/jenkins/pipelines/tvm/branches/master/runs/256/nodes/244/log/?start=0
   
   This is a issue to track the debugging process about the cause of the problem
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-tvm] zhiics commented on issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py

Posted by GitBox <gi...@apache.org>.

zhiics commented on issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py
URL: https://github.com/apache/incubator-tvm/issues/4646#issuecomment-571933507
 
 
   I tested it locally and found that the problem is caused by the tests with large input size introduced in #4511. With these tests:
   
   https://github.com/apache/incubator-tvm/blob/bc0274d307226408c69226cf922dd916d773e265/topi/tests/python/test_topi_conv2d_NCHWc.py#L232
   
   https://github.com/apache/incubator-tvm/blob/bc0274d307226408c69226cf922dd916d773e265/topi/tests/python/test_topi_conv2d_int8.py#L200
   
   https://github.com/apache/incubator-tvm/blob/bc0274d307226408c69226cf922dd916d773e265/topi/tests/python/test_topi_conv2d_nchw.py#L199
   
   the G4 instance would run out of memory but may not be the same case for other instances. This probably also explain why it was flaky because some of the CI GPU instance are P2 instances (which has at least 60G RAM https://aws.amazon.com/ec2/instance-types/p2/). 
   
   <img width="1440" alt="Screen Shot 2020-01-07 at 9 20 03 PM" src="https://user-images.githubusercontent.com/5145158/71960250-614ea700-31a9-11ea-92cf-f9d48eb30085.png">
   
   This could be reproduced through checking out the docker image and running `docker/bash.sh tvmai/ci-gpu:v0.56 tests/scripts/task_python_topi.sh` with the same config.cmake used in the CI.
   
   I tried to reduce the size of these tests and it turned out the failures are gone, it would use around 8G RAM. Hopefully this would solve the problem.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-tvm] zhiics commented on issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py

Posted by GitBox <gi...@apache.org>.

zhiics commented on issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py
URL: https://github.com/apache/incubator-tvm/issues/4646#issuecomment-571781668
 
 
   I have some initial guess about why. I will double check tonight once I get some cycles and summarize the findings if it turns out to be correct.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-tvm] zhiics closed issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py

Posted by GitBox <gi...@apache.org>.

zhiics closed issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py
URL: https://github.com/apache/incubator-tvm/issues/4646
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-tvm] zhiics edited a comment on issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py

Posted by GitBox <gi...@apache.org>.

zhiics edited a comment on issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py
URL: https://github.com/apache/incubator-tvm/issues/4646#issuecomment-571933507
 
 
   I tested it locally and found that the problem is caused by the tests with large input size introduced in #4511. With these tests:
   
   https://github.com/apache/incubator-tvm/blob/bc0274d307226408c69226cf922dd916d773e265/topi/tests/python/test_topi_conv2d_NCHWc.py#L232
   
   https://github.com/apache/incubator-tvm/blob/bc0274d307226408c69226cf922dd916d773e265/topi/tests/python/test_topi_conv2d_int8.py#L200
   
   https://github.com/apache/incubator-tvm/blob/bc0274d307226408c69226cf922dd916d773e265/topi/tests/python/test_topi_conv2d_nchw.py#L199
   
   the G4 xlarge instance would run out of memory but may not be the same case for other instances. This probably also explain why it was flaky because some of the CI GPU instance are P2 instances (which has at least 60G RAM https://aws.amazon.com/ec2/instance-types/p2/). 
   
   <img width="1440" alt="Screen Shot 2020-01-07 at 9 20 03 PM" src="https://user-images.githubusercontent.com/5145158/71960250-614ea700-31a9-11ea-92cf-f9d48eb30085.png">
   
   This could be reproduced through checking out the docker image and running `docker/bash.sh tvmai/ci-gpu:v0.56 tests/scripts/task_python_topi.sh` with the same config.cmake used in the CI.
   
   I tried to reduce the size of these tests (we probably don't want to have such large input for unit test?) and it turned out the failures are gone. Now it would use around 8G RAM (it was around 7G before this PR if I remember correctly). Hopefully this would solve the problem.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-tvm] optima2005 commented on issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py

Posted by GitBox <gi...@apache.org>.

optima2005 commented on issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py
URL: https://github.com/apache/incubator-tvm/issues/4646#issuecomment-571946335
 
 
   I miss-looks the data size and output filters columns in those additional test cases. Sorry about that. I will re-check them. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-tvm] zhiics edited a comment on issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py

Posted by GitBox <gi...@apache.org>.

zhiics edited a comment on issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py
URL: https://github.com/apache/incubator-tvm/issues/4646#issuecomment-571933507
 
 
   I tested it locally and found that the problem is caused by the tests with large input size introduced in #4511. With these tests:
   
   https://github.com/apache/incubator-tvm/blob/bc0274d307226408c69226cf922dd916d773e265/topi/tests/python/test_topi_conv2d_NCHWc.py#L232
   
   https://github.com/apache/incubator-tvm/blob/bc0274d307226408c69226cf922dd916d773e265/topi/tests/python/test_topi_conv2d_int8.py#L200
   
   https://github.com/apache/incubator-tvm/blob/bc0274d307226408c69226cf922dd916d773e265/topi/tests/python/test_topi_conv2d_nchw.py#L199
   
   the G4 xlarge instance would run out of memory but may not be the same case for other instances. This probably also explain why it was flaky because some of the CI GPU instance are P2 instances (which has at least 60G RAM https://aws.amazon.com/ec2/instance-types/p2/). 
   
   <img width="1440" alt="Screen Shot 2020-01-07 at 9 20 03 PM" src="https://user-images.githubusercontent.com/5145158/71960250-614ea700-31a9-11ea-92cf-f9d48eb30085.png">
   
   This could be reproduced through checking out the docker image and running `docker/bash.sh tvmai/ci-gpu:v0.56 tests/scripts/task_python_topi.sh` with the same config.cmake used in the CI.
   
   I tried to reduce the size of these tests and it turned out the failures are gone, it would use around 8G RAM. Hopefully this would solve the problem.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-tvm] zhiics commented on issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py

Posted by GitBox <gi...@apache.org>.

zhiics commented on issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py
URL: https://github.com/apache/incubator-tvm/issues/4646#issuecomment-572275446
 
 
   Close by #4653 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-tvm] tqchen commented on issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py::test_topi_depthwise_conv2d_backward_weight_nhwc

Posted by GitBox <gi...@apache.org>.

tqchen commented on issue #4646: [TEST][FLAKY] topi/tests/python/test_topi_depthwise_conv2d_back_weight.py::test_topi_depthwise_conv2d_backward_weight_nhwc
URL: https://github.com/apache/incubator-tvm/issues/4646#issuecomment-571780917
 
 
   @zhiics said he is able to confirm a repro on the particular instance. Here are a few steps:
   
   - run a bisection to check exactly what commit brings the problem
   - try to get a backtrace from gdb to see what is going on
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services