You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/09/01 17:12:51 UTC
[GitHub] [incubator-mxnet] anko-intel opened a new pull request #19067: [WIP] Fix compilation for large tensor with MKL
anko-intel opened a new pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067
POC for compilation with:
-DMKL_USE_ILP64=ON -DUSE_INT64_TENSOR_SIZE=ON -DUSE_BLAS=mkl
Later I will try to set MKL_USE_ILP64 to ON when USE_INT64_TENSOR_SIZE is ON
in cmake rules.
## Description ##
Work in progress for issue #18954
it allows to avoid issue with MKL/Mxnet 64 bits integer definition difference.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
mxnet-bot commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-723382296
Jenkins CI successfully triggered : [windows-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 removed a comment on pull request #19067: [WIP] Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 removed a comment on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-693572287
@mxnet-bot run ci [centos-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #19067: [WIP] Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
mxnet-bot commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-693655228
Jenkins CI successfully triggered : [unix-gpu, centos-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu merged pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
leezu merged pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-723382236
@Zha0q1 yes, that's the bug. NVCC is non-deterministically producing invalid output causing the MSVC to fail. You can find more details at https://github.com/thrust/thrust/issues/1090 Unfortunately NVidia does not want to fix this in Cuda 10 but only Cuda 11..
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
mxnet-bot commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-708557683
Jenkins CI successfully triggered : [centos-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 edited a comment on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-723394792
Regarding the type conversion on the gpu path for operator `det` and `slogdet`, maybe we can use a kernel like this:
```c++
struct CopyArray {
template<typename SType, typename DType>
MSHADOW_XINLINE static void Map(size_t i, SType* src, DType* dest) {
dest[i] = src[i];
}
};
......
// in det forward
if (std::is_same<xpu, gpu>::value && !std::is_same<IndexT, int>::value) {
using IndexInternalT = typename LapackIndex<xpu>::IndexT;
Tensor<xpu, 2, IndexInternalT> workspace =
ctx.requested[0].get_space_typed<xpu, 2, IndexInternalT>(pivot.shape_, s);
linalg_batch_getrf(LU, workspace, false, s);
Kernel<CopyArray, xpu>::Launch(s, pivot.shape_.Size(), workspace.dptr_, pivot.dptr_);
} else {
linalg_batch_getrf(LU, pivot, false, s);
}
```
POC https://github.com/apache/incubator-mxnet/pull/19489/files
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
mxnet-bot commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-722802693
Jenkins CI successfully triggered : [windows-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 edited a comment on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-723394792
Regarding the type conversion on the gpu path for operator `det` and `slogdet`, maybe we can use a kernel like this:
```c++
struct CopyArray {
template<typename SType, typename DType>
MSHADOW_XINLINE static void Map(size_t i, SType* src, DType* dest) {
dest[i] = src[i];
}
};
......
// in det forward
if (std::is_same<xpu, gpu>::value && !std::is_same<IndexT, int>::value) {
using IndexInternalT = typename LapackIndex<xpu>::IndexT;
Tensor<xpu, 2, IndexInternalT> workspace =
ctx.requested[0].get_space_typed<xpu, 2, IndexInternalT>(pivot.shape_, s);
linalg_batch_getrf(LU, workspace, false, s);
Kernel<CopyArray, xpu>::Launch(s, pivot.shape_.Size(), workspace.dptr_, pivot.dptr_);
} else {
linalg_batch_getrf(LU, pivot, false, s);
}
......
// in det backward
if (std::is_same<xpu, gpu>::value && !std::is_same<IndexT, int>::value) {
using IndexInternalT = typename LapackIndex<xpu>::IndexT;
Tensor<xpu, 2, IndexInternalT> workspace =
ctx.requested[0].get_space_typed<xpu, 2, IndexInternalT>(pivot.shape_, s);
Kernel<CopyArray, xpu>::Launch(s, pivot.shape_.Size(), pivot.dptr_, workspace.dptr_);
linalg_batch_det_backward_helper(LU, workspace, det, dA, DType(0), ctx);
} else {
linalg_batch_det_backward_helper(LU, pivot, det, dA, DType(0), ctx);
}
```
POC https://github.com/apache/incubator-mxnet/pull/19489/files
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #19067: [WIP] Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
mxnet-bot commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-685008431
Hey @anko-intel , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:
- To trigger all jobs: @mxnet-bot run ci [all]
- To trigger specific jobs: @mxnet-bot run ci [job1, job2]
***
**CI supported jobs**: [unix-cpu, website, windows-gpu, sanity, edge, centos-gpu, clang, miscellaneous, windows-cpu, centos-cpu, unix-gpu]
***
_Note_:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-722793082
@anko-intel could you check the windows build? Thanks!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-729854828
PR is ready! Would you review @leezu
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 commented on pull request #19067: [WIP] Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-693572287
@mxnet-bot run ci [centos-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 edited a comment on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-723395187
```c++
template<
typename xpu,
typename IndexT,
std::enable_if_t<!std::is_same<IndexT, lapack_index_t>::value, int> = 0>
inline void convert_to_int_if_needed(
Stream<xpu> *s,
const Tensor<xpu, 2, IndexT>& tensor) {
}
// convertion to int is required only for GPU when IndexT is equal lapack_index_t (int64_t)
template<
typename xpu,
typename IndexT,
std::enable_if_t<std::is_same<IndexT, lapack_index_t>::value, int> = 0>
inline void convert_to_int_if_needed(
Stream<xpu> *s,
const Tensor<xpu, 2, IndexT>& tensor) {
#ifdef __CUDACC__
CHECK_LE(tensor.shape_[0], std::numeric_limits<int>::max())
<< "Tensor has size greater than supported.";
CHECK_LE(tensor.shape_[1], std::numeric_limits<int>::max())
<< "Tensor has size greater than supported.";
cudaStream_t stream = Stream<xpu>::GetStream(s);
size_t elements = tensor.shape_.Size();
std::vector<IndexT> vec(elements, 0);
IndexT* ptr = vec.data();
int* ptr_int = reinterpret_cast<int*>(vec.data());
CUDA_CALL(cudaMemcpyAsync(ptr, reinterpret_cast<IndexT*>(tensor.dptr_),
tensor.MSize() * sizeof(IndexT),
cudaMemcpyDeviceToHost, stream));
for (IndexT i = 0; i < elements; ++i) {
ptr_int[i] = static_cast<int>(ptr[i]);
}
CUDA_CALL(cudaMemcpyAsync(tensor.dptr_, ptr,
tensor.MSize() * sizeof(IndexT),
cudaMemcpyHostToDevice, stream));
#endif
}
```
This might be the trigger of the windows build issue if I were to take a guess.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] anko-intel commented on a change in pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
anko-intel commented on a change in pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#discussion_r519839681
##########
File path: src/operator/tensor/la_op-inl.h
##########
@@ -931,15 +1020,20 @@ struct det_backward {
if (dA.shape_.Size() == 0U) {
return;
}
- // compute inverse(A) and stores it to LU
- linalg_batch_det_backward_helper(LU, pivot, det, dA, DType(0), ctx);
+ Stream<xpu> *s = ctx.get_stream<xpu>();
+ convert_to_int_if_needed(s, pivot);
+ // Calculations on the GPU path are internally done on int type.
+ using IndexInternalT = typename LapackIndex<xpu>::IndexT;
+ linalg_batch_det_backward_helper(LU,
+ reinterpret_cast<const Tensor<xpu, 2, IndexInternalT>&>(pivot),
+ det, dA, DType(0), ctx);
const_cast<Tensor<xpu, 3, DType>&>(dA) = broadcast_to(reshape(det * ddet, \
Shape3(det.size(0), 1, 1)), mxnet::TShape(LU.shape_)) * \
transpose(LU, Shape3(0, 2, 1));
- Stream<xpu> *s = ctx.get_stream<xpu>();
// stop grad for zero det temporarily
Kernel<StopZeroDetGrad, xpu>::Launch(s, dA.shape_.Size(), dA.size(1) * dA.size(2), \
dA.dptr_, det.dptr_, DType(0));
+ convert_to_int64_if_needed(s, pivot);
Review comment:
You are right. I had doubt if we can modify memory pointed by the input tensor. I will remove this line
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #19067: [WIP] Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
mxnet-bot commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-694950064
Jenkins CI successfully triggered : [centos-cpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
mxnet-bot commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-722804488
Unauthorized access detected.
Only following 3 categories can trigger CI :
PR Author, MXNet Committer, Jenkins Admin.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 commented on pull request #19067: [WIP] Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-693724509
> @mxnet-bot run ci [centos-gpu, unix-gpu]
Looks like it was just being flacky
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] anko-intel commented on pull request #19067: [WIP] Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
anko-intel commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-694950003
@mxnet-bot run ci [centos-cpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-722679329
@mxnet-bot run ci [windows-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] anko-intel commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
anko-intel commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-724729577
@mxnet-bot run ci [windows-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-723394792
Regarding the type conversion on the gpu path for operator `det` and `slogdet`, maybe we can use a kernel like this:
```c++
struct CopyArray {
template<typename SType, typename DType>
MSHADOW_XINLINE static void Map(size_t i, SType* src, DType* dest) {
dest[i] = src[i];
}
};
......
// in det forward
if (std::is_same<xpu, gpu>::value && !std::is_same<IndexT, int>::value) {
using IndexInternalT = typename LapackIndex<xpu>::IndexT;
Tensor<xpu, 2, IndexInternalT> workspace =
ctx.requested[0].get_space_typed<xpu, 2, IndexInternalT>(pivot.shape_, s);
linalg_batch_getrf(LU, workspace, false, s);
Kernel<CopyArray, xpu>::Launch(s, pivot.shape_.Size(), workspace.dptr_, pivot.dptr_);
} else {
linalg_batch_getrf(LU, pivot, false, s);
}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-723395187
```c++
template<
typename xpu,
typename IndexT,
std::enable_if_t<!std::is_same<IndexT, lapack_index_t>::value, int> = 0>
inline void convert_to_int_if_needed(
Stream<xpu> *s,
const Tensor<xpu, 2, IndexT>& tensor) {
}
// convertion to int is required only for GPU when IndexT is equal lapack_index_t (int64_t)
template<
typename xpu,
typename IndexT,
std::enable_if_t<std::is_same<IndexT, lapack_index_t>::value, int> = 0>
inline void convert_to_int_if_needed(
Stream<xpu> *s,
const Tensor<xpu, 2, IndexT>& tensor) {
#ifdef __CUDACC__
CHECK_LE(tensor.shape_[0], std::numeric_limits<int>::max())
<< "Tensor has size greater than supported.";
CHECK_LE(tensor.shape_[1], std::numeric_limits<int>::max())
<< "Tensor has size greater than supported.";
cudaStream_t stream = Stream<xpu>::GetStream(s);
size_t elements = tensor.shape_.Size();
std::vector<IndexT> vec(elements, 0);
IndexT* ptr = vec.data();
int* ptr_int = reinterpret_cast<int*>(vec.data());
CUDA_CALL(cudaMemcpyAsync(ptr, reinterpret_cast<IndexT*>(tensor.dptr_),
tensor.MSize() * sizeof(IndexT),
cudaMemcpyDeviceToHost, stream));
for (IndexT i = 0; i < elements; ++i) {
ptr_int[i] = static_cast<int>(ptr[i]);
}
CUDA_CALL(cudaMemcpyAsync(tensor.dptr_, ptr,
tensor.MSize() * sizeof(IndexT),
cudaMemcpyHostToDevice, stream));
#endif
}
```
This might be the cause of the windows build issue if I were to take a guess.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
leezu commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-723382379
@anko-intel can you try to rebase and force-push?
I'm concerned about merging this PR as for unknown reason it makes the nvcc bug more frequent. Maybe @josephevans can provide an ETA for the Windows Cuda 11 work and we may be able to wait for that?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] anko-intel commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
anko-intel commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-708557604
@mxnet-bot run ci [centos-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #19067: [WIP] Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
mxnet-bot commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-693572335
Unauthorized access detected.
Only following 3 categories can trigger CI :
PR Author, MXNet Committer, Jenkins Admin.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
mxnet-bot commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-724852789
Jenkins CI successfully triggered : [windows-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
leezu commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-722802641
@Zha0q1 @anko-intel the windows gpu failure is to an infamous bug in Cuda 10. It is usually mitigated by retrying compilation for 5 times, but your PR was unlucky or the code change has increased the probability of the Cuda compiler bug.
@josephevans is helping to update the CI to Cuda 11 to finally get rid of the bug. CC @sandeep-krishnamurthy
@mxnet-bot run ci [windows-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 edited a comment on pull request #19067: [WIP] Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-693625762
@mxnet_bot run ci [centos-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] anko-intel commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
anko-intel commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-723039439
@mxnet-bot run ci [windows-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
leezu commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-723382236
@Zha0q1 yes, that's the bug. NVCC is non-deterministically producing invalid output causing the MSVC to fail. You can find more details at https://github.com/thrust/thrust/issues/1090
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-722912739
> @Zha0q1 @anko-intel the windows gpu failure is to an infamous bug in Cuda 10. It is usually mitigated by retrying compilation for 5 times, but your PR was unlucky or the code change has increased the probability of the Cuda compiler bug.
It's failing again. We are talking about this issue right?
```
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2993: 'T': illegal type for non-type template parameter '__formal'
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): note: see reference to class template instantiation 'thrust::detail::allocator_traits_detail::has_system_type<T>' being compiled
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2065: 'U1': undeclared identifier
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2923: 'std::_Select<__formal>::_Apply': 'U1' is not a valid template type argument for parameter '<unnamed-symbol>'
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C4430: missing type specifier - int assumed. Note: C++ does not support default-int
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2144: syntax error: 'unknown-type' should be preceded by ')'
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2144: syntax error: 'unknown-type' should be preceded by ';'
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2238: unexpected token(s) preceding ';'
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2059: syntax error: ')'
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2988: unrecognizable template declaration/definition
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2059: syntax error: '<end Parse>'
[2020-11-06T06:14:35.175Z] ../3rdparty/tvm/nnvm/include\nnvm/symbolic.h(73): warning C4251: 'nnvm::Symbol::outputs': class 'std::vector<nnvm::NodeEntry,std::allocator<nnvm::NodeEntr
```
that's really an interesting bug :p
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
mxnet-bot commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-723039484
Jenkins CI successfully triggered : [windows-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
mxnet-bot commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-724729648
Jenkins CI successfully triggered : [windows-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 commented on a change in pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on a change in pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#discussion_r518975411
##########
File path: src/operator/tensor/la_op-inl.h
##########
@@ -931,15 +1020,20 @@ struct det_backward {
if (dA.shape_.Size() == 0U) {
return;
}
- // compute inverse(A) and stores it to LU
- linalg_batch_det_backward_helper(LU, pivot, det, dA, DType(0), ctx);
+ Stream<xpu> *s = ctx.get_stream<xpu>();
+ convert_to_int_if_needed(s, pivot);
+ // Calculations on the GPU path are internally done on int type.
+ using IndexInternalT = typename LapackIndex<xpu>::IndexT;
+ linalg_batch_det_backward_helper(LU,
+ reinterpret_cast<const Tensor<xpu, 2, IndexInternalT>&>(pivot),
+ det, dA, DType(0), ctx);
const_cast<Tensor<xpu, 3, DType>&>(dA) = broadcast_to(reshape(det * ddet, \
Shape3(det.size(0), 1, 1)), mxnet::TShape(LU.shape_)) * \
transpose(LU, Shape3(0, 2, 1));
- Stream<xpu> *s = ctx.get_stream<xpu>();
// stop grad for zero det temporarily
Kernel<StopZeroDetGrad, xpu>::Launch(s, dA.shape_.Size(), dA.size(1) * dA.size(2), \
dA.dptr_, det.dptr_, DType(0));
+ convert_to_int64_if_needed(s, pivot);
Review comment:
I think the only output is dA? In that case we might not need to convert the results back to int64
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] anko-intel commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
anko-intel commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-724852740
@mxnet-bot run ci [windows-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
mxnet-bot commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-729802457
Jenkins CI successfully triggered : [windows-cpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
leezu commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-723382290
@mxnet-bot run ci [windows-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] access2rohit commented on pull request #19067: [WIP] Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
access2rohit commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-703843530
re-triggerred sanity
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] anko-intel commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
anko-intel commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-724057472
Thank you @Zha0q1 for your comments. I measured performance with your kernel from commit: https://github.com/apache/incubator-mxnet/pull/19489/commits/31e07c4421aaa58a4812d42e9964af7986356b45 and it gives better results that my original https://github.com/apache/incubator-mxnet/pull/19067/commits/fd624bea7f2a9d38605a275e75c6112a71cdd0d2:
![image](https://user-images.githubusercontent.com/58251767/98555244-004c9f80-22a2-11eb-822a-e40057a21150.png)
So I put it as solution for dot and slogdet on GPU path.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 commented on pull request #19067: [WIP] Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-693625762
@mxnet_bot run ci [centos_gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 edited a comment on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-722912739
> @Zha0q1 @anko-intel the windows gpu failure is to an infamous bug in Cuda 10. It is usually mitigated by retrying compilation for 5 times, but your PR was unlucky or the code change has increased the probability of the Cuda compiler bug.
It's failing again. We are talking about this issue right?
```
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2993: 'T': illegal type for non-type template parameter '__formal'
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): note: see reference to class template instantiation 'thrust::detail::allocator_traits_detail::has_system_type<T>' being compiled
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2065: 'U1': undeclared identifier
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2923: 'std::_Select<__formal>::_Apply': 'U1' is not a valid template type argument for parameter '<unnamed-symbol>'
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C4430: missing type specifier - int assumed. Note: C++ does not support default-int
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2144: syntax error: 'unknown-type' should be preceded by ')'
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2144: syntax error: 'unknown-type' should be preceded by ';'
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2238: unexpected token(s) preceding ';'
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2059: syntax error: ')'
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2988: unrecognizable template declaration/definition
[2020-11-06T06:14:35.175Z] C:/Windows/TEMP/tmpkffgnho0/thrust-1.9.8\thrust/detail/allocator/allocator_traits.h(54): error C2059: syntax error: '<end Parse>'
[2020-11-06T06:14:35.175Z] ../3rdparty/tvm/nnvm/include\nnvm/symbolic.h(73): warning C4251: 'nnvm::Symbol::outputs': class 'std::vector<nnvm::NodeEntry,std::allocator<nnvm::NodeEntr
```
that's really an interesting bug :p
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #19067: [WIP] Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
mxnet-bot commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-693724530
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 edited a comment on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-722804472
> @Zha0q1 @anko-intel the windows gpu failure is to an infamous bug in Cuda 10. It is usually mitigated by retrying compilation for 5 times, but your PR was unlucky or the code change has increased the probability of the Cuda compiler bug.
>
Ohh I see. Thanks for the clarification!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] anko-intel commented on pull request #19067: [WIP] Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
anko-intel commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-693655181
@mxnet-bot run ci [centos-gpu, unix-gpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
mxnet-bot commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-722679350
Unauthorized access detected.
Only following 3 categories can trigger CI :
PR Author, MXNet Committer, Jenkins Admin.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-722804472
> @Zha0q1 @anko-intel the windows gpu failure is to an infamous bug in Cuda 10. It is usually mitigated by retrying compilation for 5 times, but your PR was unlucky or the code change has increased the probability of the Cuda compiler bug.
>
> @josephevans is helping to update the CI to Cuda 11 to finally get rid of the bug. CC @sandeep-krishnamurthy
>
> @mxnet-bot run ci [windows-gpu]
Ohh I see. Thanks for the clarification!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] leezu edited a comment on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
leezu edited a comment on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-723382379
@anko-intel can you try to rebase and force-push if windows-gpu fails again?
I'm concerned about merging this PR as for unknown reason it makes the nvcc bug more frequent. Maybe @josephevans can provide an ETA for the Windows Cuda 11 work and we may be able to wait for that?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] anko-intel commented on pull request #19067: Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
anko-intel commented on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-729802415
@mxnet-bot run ci [windows-cpu]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-mxnet] Zha0q1 edited a comment on pull request #19067: [WIP] Fix compilation for large tensor with MKL
Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on pull request #19067:
URL: https://github.com/apache/incubator-mxnet/pull/19067#issuecomment-693724509
> @mxnet-bot run ci [centos-gpu, unix-gpu]
Test passed. Looks like it was just being flaky
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org