You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/03/05 00:45:04 UTC

[GitHub] [incubator-mxnet] ptrendx opened a new pull request #17767: [WIP] Fix and optimize handling of vectorized memory accesses

ptrendx opened a new pull request #17767: [WIP] Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767
 
 
   ## Description ##
   For operators, which performance is limited by global memory bandwidth, it is important to issue the widest possible loads, as it ensures that the bandwidth is fully utilized.
   
   Currently in MXNet we use vectorized loads and stores only for `half_t` type for a few operators (some elementwise binary operators and elementwise_sum). Unfortunately, the way it was done makes some assumptions about MXNet's NDArrays which do not hold true for all cases.
   
    - Failure 1:
   
   ```python
   import mxnet as mx
   ctx=mx.gpu()
   a = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx)
   b = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx)
   
   c = a[1:3]
   d = b[1:3]
   mx.nd.elemwise_add(c, d, out=c)
   ```
   Results in error:
   ```
   Check failed: e == cudaSuccess: CUDA: misaligned address
   ```
   
    - Failure 2:
   
   ```python
   import mxnet as mx
   ctx=mx.gpu()
   a = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx)
   b = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx)
   
   print(a)
   c = a[0:3]
   d = b[0:3]
   mx.nd.elemwise_add(c, d, out=c)
   mx.nd.waitall()
   print(c)
   print(a)
   ```
   
   gives:
   ```
   [1. 2. 3. 4.]
   <NDArray 4 @gpu(0)>
   
   [2. 4. 6.]
   <NDArray 3 @gpu(0)>
   
   [2. 4. 6. 8.]
   <NDArray 4 @gpu(0)>
   ```
   
   which is a silent data corruption (the last element of `a` should not have been changed).
   
   It was not noticed before because `a + b` for NDArrays launches the `broadcast_add` instead of `elemwise_add` (and is not vectorized), whereas in the symbolic execution slices give new allocations, which do not exhibit those issues. 
   
   This PR:
    - fixes those issues
    - introduces helpers for handling vectorization (for all types, not only `half_t`)
    - increases performance of vectorized kernels
    - introduces vectorization for all binary/unary/binary with scalar ops
    - (WIP) introduces vectorization for broadcast ops
   
   @eric-haibin-lin @sxjscience @haojin2 
   
   ## Checklist ##
   ### Essentials ###
   Please feel free to remove inapplicable items for your PR.
   - [ ] Changes are complete (i.e. I finished coding on this PR)
   - [ ] All changes have test coverage:
   - Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
   - [ ] Code is well-documented: 
   - For new C++ functions in header files, their functionalities and arguments are documented. 
   - [x] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
   
   ### Changes ###
   - [x] Properly handle vectorized loads and stores in elementwise kernels
   - [ ] Handle vectorized loads and stores in elementwise broadcast kernels

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] sxjscience commented on issue #17767: Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

sxjscience commented on issue #17767: Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-615472332
 
 
   @ptrendx Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] ptrendx commented on issue #17767: Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

ptrendx commented on issue #17767: Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-615344611
 
 
   @leezu Have you seen such error in the `edge` CI before?
   ```
   [2020-04-17T16:15:43.821Z] CMake Error at /usr/local/lib/python3.6/dist-packages/cmake/data/share/cmake-3.17/Modules/FindPackageHandleStandardArgs.cmake:164 (message):
   
   [2020-04-17T16:15:43.821Z]   Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES)
   
   [2020-04-17T16:15:43.821Z] Call Stack (most recent call first):
   
   [2020-04-17T16:15:43.821Z]   /usr/local/lib/python3.6/dist-packages/cmake/data/share/cmake-3.17/Modules/FindPackageHandleStandardArgs.cmake:445 (_FPHSA_FAILURE_MESSAGE)
   
   [2020-04-17T16:15:43.821Z]   /usr/local/lib/python3.6/dist-packages/cmake/data/share/cmake-3.17/Modules/FindOpenMP.cmake:529 (find_package_handle_standard_args)
   
   [2020-04-17T16:15:43.821Z]   CMakeLists.txt:404 (find_package)
   
   [2020-04-17T16:15:43.821Z] 
   
   [2020-04-17T16:15:43.821Z] 
   
   [2020-04-17T16:15:43.821Z] -- Configuring incomplete, errors occurred!```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] leezu commented on issue #17767: Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

leezu commented on issue #17767: Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-615386362
 
 
   The reason for mcmodel=medium not fixing the problem is that it only affects the code built by us. But we rely on the C standard library, which has `lib/../lib64/crti.o` which is not built for medium memory model.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] sxjscience commented on issue #17767: Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

sxjscience commented on issue #17767: Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-615469365
 
 
   @ptrendx Sorry to ask but are these optimizations also applicable to all the numpy operators?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] leezu commented on issue #17767: Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

leezu commented on issue #17767: Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-615419545
 
 
   @ptrendx the built on master branch (`git clean -ffxd; python ci/build.py --platform publish.centos7_gpu_cu101 /work/runtime_functions.sh build_static_python_cu101_cmake --cache-intermediate` will produce:
   
   ```
   -rw-r--r--  1 ubuntu ubuntu 2.4G Apr 17 18:46 libmxnet.a
   -rwxr-xr-x  1 ubuntu ubuntu 2.0G Apr 17 18:46 libmxnet.so
   ```
   
   On the other hand, with your PR
   
   ```
   -rw-r--r--  1 ubuntu ubuntu 2.8G Apr 17 19:12 libmxnet.a
   ```
   
   So there is a significant binary size increase.
   
   I have some ideas for reducing the size of `.so`, but the most immediate (and easy to backport) solution is to reduce the number of architectures in https://github.com/apache/incubator-mxnet/tree/master/config/distribution

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] ptrendx commented on issue #17767: Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

ptrendx commented on issue #17767: Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-615471587
 
 
   @sxjscience Frankly I did not look in to much detail in the numpy ops but from what I saw they use only partially the previous implementation, as they need to deal with potentially different types etc, which makes them use special kernels. Also, they do seem to be mostly broadcasting ops (which scares me a little in the context of implementing fusions for example, as those would not qualify). So currently I would not expect it impacting numpy ops, but would like to work on extending this work to be able to unify and optimize both numpy and non-numpy versions.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] ptrendx commented on issue #17767: [WIP] Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

ptrendx commented on issue #17767: [WIP] Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-601959240
 
 
   @haojin2 I'm hitting again the OoM on Windows build, like you did before. I started looking at the numpy versions of those functions and I see that you have way more templates there (some of which are actually not compiled on Windows) and I started thinking that we should probably just switch to runtime compilation of those kernels - there is just too many variants here. What do you think about this (also @eric-haibin-lin @szha @leezu  for comments)?
   
   Also - I don't see elementwise ops in the numpy python package, just broadcast ops - this is pretty bad because knowledge that the shapes are the same is pretty important in optimizations - for example the pointwise fusion would not really work for such operators.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] leezu edited a comment on issue #17767: [WIP] Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

leezu edited a comment on issue #17767: [WIP] Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-601967994
 
 
   For OoM, could updating the windows toolchain help https://github.com/apache/incubator-mxnet/pull/17808 ? cc  @vexilligera @josephevans 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] leezu commented on issue #17767: [WIP] Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

leezu commented on issue #17767: [WIP] Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-601967994
 
 
   For OoM, could updating the windows toolchain help https://github.com/apache/incubator-mxnet/pull/17808 ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] leezu edited a comment on issue #17767: Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

leezu edited a comment on issue #17767: Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-615386362
 
 
   The reason for mcmodel=medium not fixing the problem is that it only affects the code built by us. But we link in other code, including the C standard library, `lib/../lib64/crti.o`, which is not built for medium memory model.
   
   So there are 2 blockers:
   1) compilers don't support non-standard memory models well...
   2) we need to recompile the complete stack including the c library with our non-standard memory model to use it
   
   In summary it's not a feasible solution going forward, as we can't expect developers to do this on their systems

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] leezu commented on issue #17767: Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

leezu commented on issue #17767: Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-615380884
 
 
   It's because you modify all C compile flags with `-mcmodel=medium`. This setting is architecture dependent and while a medium memory model exists on ARMv8, gcc does not implement it. 
   
   I suspect cmake will test compile a file to check if openmp is working, which fails with gcc saying "Sorry but `-mcmodel=medium` isn't implemented yet"

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] leezu edited a comment on issue #17767: [WIP] Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

leezu edited a comment on issue #17767: [WIP] Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-601967994
 
 
   For OoM, could updating the windows toolchain help https://github.com/apache/incubator-mxnet/pull/17808 ? cc  @vexilligera

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] DickJC123 merged pull request #17767: Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

DickJC123 merged pull request #17767: Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] leezu edited a comment on issue #17767: Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

leezu edited a comment on issue #17767: Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-615380884
 
 
   It's because you modify all C compile flags with `-mcmodel=medium`. This setting is architecture dependent and while a medium memory model exists on ARMv8, gcc does not implement it for ARMv8. 
   
   I suspect cmake will test compile a file to check if openmp is working, which fails with gcc saying "Sorry but `-mcmodel=medium` isn't implemented yet"

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] leezu commented on a change in pull request #17767: [WIP] Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

leezu commented on a change in pull request #17767: [WIP] Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#discussion_r388534800
 
 

 ##########
 File path: src/common/cuda_vectorization.cuh
 ##########
 @@ -0,0 +1,228 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ *  Copyright (c) 2020 by Contributors
+ * \file cuda_vectorization.cuh
+ * \brief GPU helpers for vectorized memory accesses
+ */
+
+#ifndef MXNET_COMMON_CUDA_VECTORIZATION_CUH_
+#define MXNET_COMMON_CUDA_VECTORIZATION_CUH_
+
+#include <cuda_runtime.h>
+#include "cuda_utils.h"
+
+#if MXNET_USE_CUDA && __CUDACC__
+
+namespace mxnet {
+namespace common {
+namespace cuda {
+
+template <typename DType, typename LType>
+class VectorizedStorage {
+ public:
+  constexpr static int nvec = sizeof(LType) / sizeof(DType);
+  union vectorized_storage {
+    LType aligned;
+    DType separate[nvec];  // NOLINT(*)
+
+    MSHADOW_XINLINE vectorized_storage() {}
+    MSHADOW_XINLINE ~vectorized_storage() {}
 
 Review comment:
   True. `MSHADOW_XINLINE` further sets `__attribute__((always_inline))`. Do we need that for `nvcc`?
   
   With respect to `__host__ __device__`, this file is wrapped into `MXNET_USE_CUDA && __CUDACC__`, so gcc won't see it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] ptrendx commented on issue #17767: Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

ptrendx commented on issue #17767: Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-615405578
 
 
   Ok, considering code freeze date of 1.7 and me really wanting to include vectorization support there,I decided to limit this PR to just elementwise ops and leave the broadcast ops for the next PR, not included in 1.7.
   
   That should enable passing the CI, while still keeping at least some of the improvement.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] ptrendx commented on a change in pull request #17767: [WIP] Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

ptrendx commented on a change in pull request #17767: [WIP] Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#discussion_r388569994
 
 

 ##########
 File path: src/common/cuda_vectorization.cuh
 ##########
 @@ -0,0 +1,228 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ *  Copyright (c) 2020 by Contributors
+ * \file cuda_vectorization.cuh
+ * \brief GPU helpers for vectorized memory accesses
+ */
+
+#ifndef MXNET_COMMON_CUDA_VECTORIZATION_CUH_
+#define MXNET_COMMON_CUDA_VECTORIZATION_CUH_
+
+#include <cuda_runtime.h>
+#include "cuda_utils.h"
+
+#if MXNET_USE_CUDA && __CUDACC__
+
+namespace mxnet {
+namespace common {
+namespace cuda {
+
+template <typename DType, typename LType>
+class VectorizedStorage {
+ public:
+  constexpr static int nvec = sizeof(LType) / sizeof(DType);
+  union vectorized_storage {
+    LType aligned;
+    DType separate[nvec];  // NOLINT(*)
+
+    MSHADOW_XINLINE vectorized_storage() {}
+    MSHADOW_XINLINE ~vectorized_storage() {}
 
 Review comment:
   Frankly, not sure - generally speaking `inline` should be enough, but it is only an advice and compiler is allowed to not inline for whatever reason.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] ptrendx commented on issue #17767: Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

ptrendx commented on issue #17767: Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-615384643
 
 
   @leezu Makes sense. And the `mcmodel=medium` did not even fix the failure with the CMake static build in `unix-gpu` :-(. Do you have any idea why CMake static build fails, while normal build does not?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] ptrendx commented on issue #17767: Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

ptrendx commented on issue #17767: Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-615446136
 
 
   I think I will settle with the elementwise ops for now and instead of trying to squeeze stuff into the already huge binary, think about ways of reducing that size for longer term.
   
   Also, this current solution does not really touch numpy ops too much, so that would need to be tackled as well (and judging from the current code, will introduce much bigger bloat), so the need for real solution is big.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] ptrendx commented on a change in pull request #17767: [WIP] Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

ptrendx commented on a change in pull request #17767: [WIP] Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#discussion_r388525308
 
 

 ##########
 File path: src/common/cuda_vectorization.cuh
 ##########
 @@ -0,0 +1,228 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ *  Copyright (c) 2020 by Contributors
+ * \file cuda_vectorization.cuh
+ * \brief GPU helpers for vectorized memory accesses
+ */
+
+#ifndef MXNET_COMMON_CUDA_VECTORIZATION_CUH_
+#define MXNET_COMMON_CUDA_VECTORIZATION_CUH_
+
+#include <cuda_runtime.h>
+#include "cuda_utils.h"
+
+#if MXNET_USE_CUDA && __CUDACC__
+
+namespace mxnet {
+namespace common {
+namespace cuda {
+
+template <typename DType, typename LType>
+class VectorizedStorage {
+ public:
+  constexpr static int nvec = sizeof(LType) / sizeof(DType);
+  union vectorized_storage {
+    LType aligned;
+    DType separate[nvec];  // NOLINT(*)
+
+    MSHADOW_XINLINE vectorized_storage() {}
+    MSHADOW_XINLINE ~vectorized_storage() {}
 
 Review comment:
   MSHADOW_XINLINE is useful because it is empty if you are not running with NVCC (and GCC does not understand `__host__ __device__`).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] leezu commented on issue #17767: Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

leezu commented on issue #17767: Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-615486891
 
 
   @mxnet-bot run ci [windows-cpu]

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] szha commented on issue #17767: [WIP] Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

szha commented on issue #17767: [WIP] Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-609962115
 
 
   The windows CI has been updated w/ 64bit toolchain so this PR should be unblocked now

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] leezu commented on a change in pull request #17767: [WIP] Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

leezu commented on a change in pull request #17767: [WIP] Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#discussion_r388471315
 
 

 ##########
 File path: src/common/cuda_vectorization.cuh
 ##########
 @@ -0,0 +1,228 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ *  Copyright (c) 2020 by Contributors
+ * \file cuda_vectorization.cuh
+ * \brief GPU helpers for vectorized memory accesses
+ */
+
+#ifndef MXNET_COMMON_CUDA_VECTORIZATION_CUH_
+#define MXNET_COMMON_CUDA_VECTORIZATION_CUH_
+
+#include <cuda_runtime.h>
+#include "cuda_utils.h"
+
+#if MXNET_USE_CUDA && __CUDACC__
+
+namespace mxnet {
+namespace common {
+namespace cuda {
+
+template <typename DType, typename LType>
+class VectorizedStorage {
+ public:
+  constexpr static int nvec = sizeof(LType) / sizeof(DType);
+  union vectorized_storage {
+    LType aligned;
+    DType separate[nvec];  // NOLINT(*)
+
+    MSHADOW_XINLINE vectorized_storage() {}
+    MSHADOW_XINLINE ~vectorized_storage() {}
 
 Review comment:
   Tangential question related to `MSHADOW_XINLINE` use in this file: Do we still need `inline __attribute__((always_inline))` or is nvcc smart enough to inline? If so, can we stop using the mshadow macros and just use `__device__ __host__`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] mxnet-bot commented on issue #17767: Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on issue #17767: Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-615486920
 
 
   Jenkins CI successfully triggered : [windows-cpu]

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] leezu edited a comment on issue #17767: Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

leezu edited a comment on issue #17767: Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-615419545
 
 
   @ptrendx the built on master branch (`git clean -ffxd; python ci/build.py --platform publish.centos7_gpu_cu101 /work/runtime_functions.sh build_static_python_cu101_cmake --cache-intermediate` will produce:
   
   ```
   -rw-r--r--  1 ubuntu ubuntu 2.4G Apr 17 18:46 libmxnet.a
   -rwxr-xr-x  1 ubuntu ubuntu 2.0G Apr 17 18:46 libmxnet.so
   ```
   
   On the other hand, with your PR
   
   ```
   -rw-r--r--  1 ubuntu ubuntu 2.8G Apr 17 19:12 libmxnet.a
   ```
   
   So there is a significant binary size increase.
   
   I have some ideas for reducing the size of `.so`, but the most immediate (and easy to backport) solution is to reduce the number of architectures in https://github.com/apache/incubator-mxnet/tree/master/config/distribution
   
   For example instead of building for 5.0 and 5.2 or 7.0 and 7.5, just build for 5.0 and 7.0. 
   
   Then MXNet may not make use of new features on 5.2 / 7.5 respectively, but the improvements by your PR are likely more important than the difference between 5.0 and 5.2 / 7.0 and 7.5

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] leezu edited a comment on issue #17767: Fix and optimize handling of vectorized memory accesses

Posted by GitBox <gi...@apache.org>.

leezu edited a comment on issue #17767: Fix and optimize handling of vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767#issuecomment-615386362
 
 
   The reason for mcmodel=medium not fixing the problem is that it only affects the code built by us. But we rely on the C standard library, which has `lib/../lib64/crti.o` which is not built for medium memory model.
   
   So there are 2 blockers:
   1) compilers don't support non-standard memory models well...
   2) we need to recompile the complete stack including the c library with our non-standard memory model to use it
   
   In summary it's not a feasible solution going forward, as we can't expect developers to do this on their systems

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services