You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2020/09/25 06:33:13 UTC

[GitHub] [incubator-tvm] jcf94 opened a new pull request #6557: [Ansor][FLAKY] Bug fix for compute at mutation error

jcf94 opened a new pull request #6557:
URL: https://github.com/apache/incubator-tvm/pull/6557


   Bug fix for https://github.com/apache/incubator-tvm/issues/6548.
   
   From the error log:
   ```python
   E     tvm._ffi.base.TVMError: Traceback (most recent call last):
   E     [bt] (7) /workspace/build/libtvm.so(TVMFuncCall+0x65) [0x7f2cdb26bfb5]
   E     [bt] (6) /workspace/build/libtvm.so(+0x4e4dcf) [0x7f2cda602dcf]
   E     [bt] (5) /workspace/build/libtvm.so(tvm::auto_scheduler::AutoSchedule(tvm::auto_scheduler::SearchPolicy, tvm::auto_scheduler::TuningOptions)+0x116) [0x7f2cda6021a6]
   E     [bt] (4) /workspace/build/libtvm.so(tvm::auto_scheduler::SketchPolicyNode::Search(int, int, int, tvm::auto_scheduler::ProgramMeasurer)+0x214) [0x7f2cda698f64]
   E     [bt] (3) /workspace/build/libtvm.so(tvm::auto_scheduler::SketchPolicyNode::SearchOneRound(int, tvm::runtime::Array<tvm::auto_scheduler::State, void>*)+0x19f) [0x7f2cda6987ff]
   E     [bt] (2) /workspace/build/libtvm.so(tvm::auto_scheduler::SketchPolicyNode::SampleInitPopulation(tvm::runtime::Array<tvm::auto_scheduler::State, void> const&, int)+0x1fb) [0x7f2cda69395b]
   E     [bt] (1) /workspace/build/libtvm.so(tvm::support::parallel_for(int, int, std::function<void (int)> const&, int, std::function<std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > (int, int, int, int)>)+0x11e8) [0x7f2cdac2b9f8]
   E     [bt] (0) /workspace/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x82) [0x7f2cda606ac2]
   E     [bt] (8) /workspace/build/libtvm.so(+0x5756da) [0x7f2cda6936da]
   E     [bt] (7) /workspace/build/libtvm.so(tvm::auto_scheduler::InitChangeComputeLocation::Apply(tvm::auto_scheduler::SketchPolicyNode*, tvm::auto_scheduler::State*, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>*) const+0x22d) [0x7f2cda6a1ccd]
   E     [bt] (6) /workspace/build/libtvm.so(tvm::auto_scheduler::ComputeDAG::InferBound(tvm::auto_scheduler::State const&) const+0x253) [0x7f2cda61c783]
   E     [bt] (5) /workspace/build/libtvm.so(tvm::auto_scheduler::ComputeDAG::ApplySteps(tvm::runtime::Array<tvm::auto_scheduler::Step, void> const&, tvm::runtime::Array<tvm::te::Stage, void>*, tvm::Map<tvm::te::Stage, tvm::runtime::Array<tvm::tir::IterVar, void>, tvm::runtime::ObjectHash, tvm::runtime::ObjectEqual>*, bool) const+0x5e5) [0x7f2cda61c265]
   E     [bt] (4) /workspace/build/libtvm.so(tvm::auto_scheduler::StepApplyToSchedule(tvm::auto_scheduler::Step const&, tvm::runtime::Array<tvm::te::Stage, void>*, tvm::Map<tvm::te::Stage, tvm::runtime::Array<tvm::tir::IterVar, void>, tvm::runtime::ObjectHash, tvm::runtime::ObjectEqual>*, tvm::te::Schedule*, tvm::runtime::Array<tvm::auto_scheduler::Step, void> const&)+0x220) [0x7f2cda6d7170]
   E     [bt] (3) /workspace/build/libtvm.so(tvm::auto_scheduler::SplitStepNode::ApplyToSchedule(tvm::runtime::Array<tvm::te::Stage, void>*, tvm::Map<tvm::te::Stage, tvm::runtime::Array<tvm::tir::IterVar, void>, tvm::runtime::ObjectHash, tvm::runtime::ObjectEqual>*) const+0x39) [0x7f2cda6d0d19]
   E     [bt] (2) /workspace/build/libtvm.so(tvm::auto_scheduler::ApplySplitToSchedule(tvm::runtime::Array<tvm::te::Stage, void>*, tvm::Map<tvm::te::Stage, tvm::runtime::Array<tvm::tir::IterVar, void>, tvm::runtime::ObjectHash, tvm::runtime::ObjectEqual>*, int, int, tvm::runtime::Array<tvm::runtime::Optional<tvm::Integer>, void> const&, bool)+0xa6) [0x7f2cda6d0576]
   E     [bt] (1) /workspace/build/libtvm.so(tvm::runtime::Array<tvm::tir::IterVar, void>::operator[](long) const+0xb6) [0x7f2cda626616]
   E     [bt] (0) /workspace/build/libtvm.so(+0x4ef2c2) [0x7f2cda60d2c2]
   E     File "/workspace/src/support/parallel_for.cc", line 92
   E   TVMError: Parallel_for error with [09:24:10] /workspace/include/tvm/runtime/container.h:683: Check failed: 0 <= i && i < p->size_: IndexError: indexing 4 on an array of size 4
   ```
   
   we can find that the error of the test was caused by the inferbound error. @merrymercy 
   
   cc @tqchen @comaniac 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] jcf94 commented on pull request #6557: [Ansor][FLAKY] Bug fix for compute at mutation error

Posted by GitBox <gi...@apache.org>.
jcf94 commented on pull request #6557:
URL: https://github.com/apache/incubator-tvm/pull/6557#issuecomment-699277253


   > Thanks @jcf94 @FrozenGene
   
   Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] merrymercy edited a comment on pull request #6557: [Ansor][FLAKY] Bug fix for compute at mutation error

Posted by GitBox <gi...@apache.org>.
merrymercy edited a comment on pull request #6557:
URL: https://github.com/apache/incubator-tvm/pull/6557#issuecomment-699566591


   This kind of general exception catch is not good for future maintenance. We should dig deeper to find out the underlying cause.
   I guess it is related to multi-threading. The mutation rules, LoopState, and InferBound all work well in the single thread case.
   So some of these components are not thread-safe.
   
   #6512 does not change any logic, it just moves the location of some functions. Can you confirm this is caused by #6512 or #6529?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] jcf94 edited a comment on pull request #6557: [Ansor][FLAKY] Bug fix for compute at mutation error

Posted by GitBox <gi...@apache.org>.
jcf94 edited a comment on pull request #6557:
URL: https://github.com/apache/incubator-tvm/pull/6557#issuecomment-698780208


   > Do you try to build debug version of TVM and use `gdb --args python ...` to see the callstack and which code produce this error? Just `try...catch` seems a little brute force for me.
   
   The problem is this is not always reproduceable. The only sure thing is that the bug is caused by `InitChangeComputeLocation()` rule.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] FrozenGene commented on pull request #6557: [Ansor][FLAKY] Bug fix for compute at mutation error

Posted by GitBox <gi...@apache.org>.
FrozenGene commented on pull request #6557:
URL: https://github.com/apache/incubator-tvm/pull/6557#issuecomment-698778767


   Do you try to build debug version of TVM and use `gdb --args python ...` to see the callstack and which code produce this error? Just `try...catch` seems a little brute force for me.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] comaniac merged pull request #6557: [Ansor][FLAKY] Bug fix for compute at mutation error

Posted by GitBox <gi...@apache.org>.
comaniac merged pull request #6557:
URL: https://github.com/apache/incubator-tvm/pull/6557


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] merrymercy edited a comment on pull request #6557: [Ansor][FLAKY] Bug fix for compute at mutation error

Posted by GitBox <gi...@apache.org>.
merrymercy edited a comment on pull request #6557:
URL: https://github.com/apache/incubator-tvm/pull/6557#issuecomment-699566591


   This kind of general exception catch is not good for future maintenance. We should dig deeper to find out the underlying cause.
   I guess it is related to multi-threading. The mutation rules, LoopState, and InferBound all work well in the single thread case.
   
   #6512 does not change any logic, it just moves the location of some functions. Can you confirm this is caused by #6512 or #6529?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] merrymercy commented on pull request #6557: [Ansor][FLAKY] Bug fix for compute at mutation error

Posted by GitBox <gi...@apache.org>.
merrymercy commented on pull request #6557:
URL: https://github.com/apache/incubator-tvm/pull/6557#issuecomment-699566591


   This kind of general exception catch is not good for future maintenance. We should dig deeper to find out the underlying cause.
   I guess it is related to multi-threading. The mutation rules, LoopState, and InferBound all work well in the single thread case.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] FrozenGene commented on pull request #6557: [Ansor][FLAKY] Bug fix for compute at mutation error

Posted by GitBox <gi...@apache.org>.
FrozenGene commented on pull request #6557:
URL: https://github.com/apache/incubator-tvm/pull/6557#issuecomment-698788452


   > > Do you try to build debug version of TVM and use `gdb --args python ...` to see the callstack and which code produce this error? Just `try...catch` seems a little brute force for me.
   > 
   > The problem is this is not always reproduceable. The only sure thing is that the bug is caused by `InitChangeComputeLocation()` rule.
   
   One way you could do is remove `CHECK` inside tvm and just use script run many times and then let the program crash, then we will produce one `core` file, you could use `gdb` debug with `core` file now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] merrymercy edited a comment on pull request #6557: [Ansor][FLAKY] Bug fix for compute at mutation error

Posted by GitBox <gi...@apache.org>.
merrymercy edited a comment on pull request #6557:
URL: https://github.com/apache/incubator-tvm/pull/6557#issuecomment-699566591


   This kind of general exception catch is not good for future maintenance. We should dig deeper to find out the underlying cause.
   The mutation rules, LoopState, and InferBound all work well in the single thread case. I think some of these components are not thread-safe.
   
   #6512 does not change any logic, it just moves the location of some functions. Can you confirm this is caused by #6512 or #6529?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] jcf94 commented on pull request #6557: [Ansor][FLAKY] Bug fix for compute at mutation error

Posted by GitBox <gi...@apache.org>.
jcf94 commented on pull request #6557:
URL: https://github.com/apache/incubator-tvm/pull/6557#issuecomment-698780208


   > Do you try to build debug version of TVM and use `gdb --args python ...` to see the callstack and which code produce this error? Just `try...catch` seems a little brute force for me.
   
   The problem is this is not always reproduceable. The only sure thing is that the bug is caused by `InitChangeComputeLocation()`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] FrozenGene commented on pull request #6557: [Ansor][FLAKY] Bug fix for compute at mutation error

Posted by GitBox <gi...@apache.org>.
FrozenGene commented on pull request #6557:
URL: https://github.com/apache/incubator-tvm/pull/6557#issuecomment-698778767






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] comaniac commented on pull request #6557: [Ansor][FLAKY] Bug fix for compute at mutation error

Posted by GitBox <gi...@apache.org>.
comaniac commented on pull request #6557:
URL: https://github.com/apache/incubator-tvm/pull/6557#issuecomment-699030532


   Thanks @jcf94 @FrozenGene 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] jcf94 commented on pull request #6557: [Ansor][FLAKY] Bug fix for compute at mutation error

Posted by GitBox <gi...@apache.org>.
jcf94 commented on pull request #6557:
URL: https://github.com/apache/incubator-tvm/pull/6557#issuecomment-698780208


   > Do you try to build debug version of TVM and use `gdb --args python ...` to see the callstack and which code produce this error? Just `try...catch` seems a little brute force for me.
   
   The problem is this is not always reproduceable. The only sure thing is that the bug is caused by `InitChangeComputeLocation()`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-tvm] jcf94 edited a comment on pull request #6557: [Ansor][FLAKY] Bug fix for compute at mutation error

Posted by GitBox <gi...@apache.org>.
jcf94 edited a comment on pull request #6557:
URL: https://github.com/apache/incubator-tvm/pull/6557#issuecomment-698780208


   > Do you try to build debug version of TVM and use `gdb --args python ...` to see the callstack and which code produce this error? Just `try...catch` seems a little brute force for me.
   
   The problem is this is not always reproduceable. The only sure thing is that the bug is caused by `InitChangeComputeLocation()` rule.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org