You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/08/03 22:37:25 UTC

[GitHub] [incubator-mxnet] Zha0q1 opened a new issue #18855: NumPy BLAS Clashing with MXNet BLAS

Zha0q1 opened a new issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855


   Both NumPy and MXNet are dependent on BLAS. When they are linked to different BLAS libraries there will be a name clashing issue. Effectively, only functions from NumPy's BLAS will be used by both NumPy and MXNet.
   
   According to https://stackoverflow.com/questions/47891872/how-to-use-non-mkl-numpy-under-anaconda, anaconda will by default ship MKL-dependent NumPy. This is also the case on DLAMI 30:
   ```
   ubuntu@ip-172-31-40-81:~$ python3
   Python 3.7.7 (default, Mar 26 2020, 15:48:22) 
   [GCC 7.3.0] :: Anaconda, Inc. on linux
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import numpy as np
   >>> np.show_config()
   blas_mkl_info:
       libraries = ['mkl_rt', 'pthread']
       library_dirs = ['/home/ubuntu/anaconda3/lib']
       define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
       include_dirs = ['/home/ubuntu/anaconda3/include']
   blas_opt_info:
       libraries = ['mkl_rt', 'pthread']
       library_dirs = ['/home/ubuntu/anaconda3/lib']
       define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
       include_dirs = ['/home/ubuntu/anaconda3/include']
   lapack_mkl_info:
       libraries = ['mkl_rt', 'pthread']
       library_dirs = ['/home/ubuntu/anaconda3/lib']
       define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
       include_dirs = ['/home/ubuntu/anaconda3/include']
   lapack_opt_info:
       libraries = ['mkl_rt', 'pthread']
       library_dirs = ['/home/ubuntu/anaconda3/lib']
       define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
       include_dirs = ['/home/ubuntu/anaconda3/include']
   >>> 
   
   ```
   
   I first ran into this issue while working on adding large tensor support to linalg operators, where I used a manually built int 64 version of Open BLAS. I used this simple test script:
   ```
   def run_test():
     import mxnet as mx
     from mxnet import nd
   
     # large tensor, only works on int 64 BLAS
     A=mx.nd.ones(shape=(1, 2**31))
     nd.linalg.syrk(A)
     nd.waitall()
   
   if __name__ == '__main__':
       run_test()
   ```
   
   On my machine (DLAMI 30 Ubuntu 18) Open BLAS is built with `DYNAMIC_ARCH=1 DYNAMIC_OLDER=1 USE_OPENMP=1 INTERFACE64=1 BINARY=64 NO_SHARED=0 NO_LAPACK=0` and MXNet is built with `USE_BLAS="open" USE_INT64_TENSOR_SIZE=1`. Numpy is pre-installed with MKL optimization. 
   
   Ideally, `linalg.syrk` would invoke Open BLAS `cblas_ssyrk` (my build, 64 bit int), but in reality because of the name clashing, MKL `cblas_ssyrk` (32 bit int) is called instead. This will lead to:
   ```
   ubuntu@ip-172-31-40-81:~$ python test.py 
   [21:58:23] ../src/storage/storage.cc:198: Using Pooled (Naive) StorageManager for CPU
   oooof
   
   Intel MKL ERROR: Parameter 5 was incorrect on entry to cblas_ssyrk.
   ```
   
   Using GDB we can see we are indeed calling into MKL `cblas_ssyrk`:
   ```
   [22:02:04] ../src/storage/storage.cc:198: Using Pooled (Naive) StorageManager for CPU
   oooof
   [Switching to Thread 0x7ffdcffff700 (LWP 22329)]
   
   Thread 6 "python3" hit Breakpoint 1, 0x00007ffff608fe50 in cblas_ssyrk_ ()
      from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mkl/../../../libmkl_rt.so
   (gdb) bt
   #0  0x00007ffff608fe50 in cblas_ssyrk_ ()
      from /home/ubuntu/anaconda3/lib/python3.7/site-packages/mkl/../../../libmkl_rt.so
   #1  0x00007fffe8b10c85 in linalg_syrk<mshadow::cpu, float> (s=<optimized out>, tA=false, beta=0, alpha=1, 
       B=..., A=...) at ../src/operator/tensor/./../linalg_impl.h:983
   #2  linalg_batch_syrk<mshadow::cpu, float> (s=<optimized out>, tA=false, beta=0, alpha=1, B=..., A=...)
       at ../src/operator/tensor/./../linalg_impl.h:985
   #3  mxnet::op::syrk::op<mshadow::cpu, float> (s=<optimized out>, tA=false, beta=0, alpha=1, B=..., A=...)
       at ../src/operator/tensor/./la_op-inl.h:340
   #4  mxnet::op::syrk::op<mshadow::cpu, float> (attrs=..., s=<optimized out>, B=..., A=...)
       at ../src/operator/tensor/./la_op-inl.h:350
   #5  mxnet::op::syrk::op<mshadow::cpu, float> (attrs=..., ctx=..., B=..., A=...)
       at ../src/operator/tensor/./la_op-inl.h:356
   #6  mxnet::op::LaOpCaller<mshadow::cpu, float, 2, 2, 1, 1, mxnet::op::syrk>::op (axis=-2, ctx=..., 
       attrs=..., outputs=..., inputs=...) at ../src/operator/tensor/./la_op.h:560
   #7  mxnet::op::LaOpForward<mshadow::cpu, 2, 2, 1, 1, mxnet::op::syrk> (attrs=..., ctx=..., inputs=..., 
       req=..., outputs=...) at ../src/operator/tensor/./la_op.h:671
   #8  0x00007fffe56ed740 in std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)>::operator()(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) const (__args#4=std::vector of length 1, capacity 1 = {...}, 
       __args#3=std::vector of length 1, capacity 1 = {...}, 
       __args#2=std::vector of length 1, capacity 1 = {...}, __args#1=..., __args#0=..., this=0x555556371c38)
       at /usr/include/c++/7/bits/std_function.h:706
   #9  mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const (
       __closure=0x555556371bb0, rctx=...) at ../src/imperative/./imperative_utils.h:494
   ```
   
   Reinstalling NumPy and linking it to my Open BLAS build resolved the issue for me.
   
   
   
   
   
   
   So the problem with this name clashing issue is that regardless of what BLAS we build MXNet with, we are stuck with the BLAS that NumPy is configured to use. While in most cases, such as supporting large tensor i.e. 64-bit indexing, we can configure them to use the same BLAS lib, I wonder if there is special use case where we actually want different BLAS for NumPy and MXNet?
   
   My guess would be "no", but still we should be aware of this issue as well as the extra step to link NumPy and MXNet to the same BLAS, and we probably need to note so in our build tutorial
   
   This same issue is also noted on NumPy's build-from-source page: https://numpy.org/devdocs/user/building.html. Open BLAS support building with function prefixes and suffixes and NumPy can recognize suffixes like "64_" when built with 64 bit int support. We could do something like this potentially, adding a suffix/prefix to BLAS functions and use those names in MXNet, but again it's much easier to link NumPy and MXNet to the same BLAS
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-668305438


   > is this from pip package or building from source? for pip, we statically link blas/lapacke and stripped all symbols that are not exposed. this means that the blas/lapacke functions are not in the symbol table.
   
   This if for building from source. Are we planing to support large tensors by default in the future? Currently it's controlled by a flag and I think the distributed whl's don't have it turned on
   (I just joined the effort to support large tensors for v1.x and 2.0.)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-670233881


   BTW int 32 blas will work for tensors with size > INT_MAX (2**31 - 1 ), it's when a dim is > INT_MAX we must use int 64 blas, because in the function declarations they use  int 32 for stride


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-670657830


   For reference, here are the things I tried:
   
   1. build from source mxnet w/ 64 openblas no suffix + mkl numpy ==> mkl shadows 64 openblas
   2. build from source mxnet w/ 64 openblas no suffix + 64 openblas numpy ==> works; calls correct function
   3. current static build script mxnet w/ 64 openblas no suffix + mkl numpy ==> mkl shadows 64 openblas
   4. static build script mxnet w/ 64 openblas no suffix + mkl numpy + https://github.com/apache/incubator-mxnet/pull/18872 ==> TODO
   5. build from source mxnet w/64 openblas suffixed + mkl numpy + https://github.com/apache/incubator-mxnet/pull/18873 ==> works; calls correct function


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-669300523


   As discussed offline, we should avoid requiring all users to build openblas from source, especially as distributions start to provide `libopenblas64_.so` versions with OpenBLAS ILP64 build and `64_` suffix in symbol name. Fedora in https://bugzilla.redhat.com/show_bug.cgi?id=1295965. Debian testing / Ubuntu 20.04 provide `libopenblas64.so` without suffix per https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=878121 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=910924 I initiated a discussion on https://lists.debian.org/debian-science/2020/08/threads.html about providing the same package in Debian.
   
   To avoid symbol version clashes as reported in these issues, we must require the suffixed symbol names or static link. We can work with cmake upstream on a standard approach to detecting 64bit suffixed openblas as part of https://cmake.org/cmake/help/latest/module/FindBLAS.html We can backport the upstream recommended approach to our https://github.com/apache/incubator-mxnet/tree/master/cmake/upstream
   
   To support older distributions that do not ship with `libopenblas64_.so`, we should include instructions for building `libopenblas64_.so` from source in our guide.
   
   Static linking is a great solution for the pip distribution build, but I don't think we should settle on a solution that requires static linking, because there are already better solution established as best practice (ie. symbol suffix) and this will restrict the versatility of MXNet source build and make it harder to package MXNet as part of Debian or other distributinos.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-671559514


   > > One more thing is that currently the static build script defaults to dynamically linking openblas.
   > 
   > There's a step that disable/remove shared object for dynamic linking.
   
   Yes, I did that and was able to build the wheel. I think we should add that back too


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-669601375


   If you want to make this work via static linking, you need to use
   
   ```
   -Bsymbolic
     When creating a shared library, bind references to global symbols to the 
     definition within the shared library, if any. Normally, it is possible 
     for a program linked against a shared library to override the definition 
     within the shared library. 
   
     This option is only meaningful on ELF platforms which support shared libraries.
   ```
   
   But instead we can just workaround the issue by adopting the 64_ suffix convention.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-670229390


   Thanks! @szha @leezu  I just looked into adding a suffix to cblas/lapack calls within our code base. There are ~50 places in a handful of different files that all need to be changed. This makes me think if this change is too heavy.  Also 1. openblas is used in TVM, so they will need to make the same change to be consistent with us. 2. like @leezu mentioned finding 64 openblas is not well supported and distros also don't have a unified solution to it.  For our purpose of supporting large tensors in 2.0, if we could link 64 openblas statically for our soon-to-come release, I would still think that’s the best solution.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-668312227


   > The problem here is that we can never know what BLAS libraries are linked into other Python packages that our users happen to use. NumPy may be a popular one, but other packages may also link different BLAS implementations. The same issue applies to the OpenMP dependency.
   > 
   > Did you statically link BLAS implementation? You can refer to #17751. For dynamic linking, the loader will load the implementation referenced by the first package that is loaded and not load other BLAS implementations if the BLAS symbols are already available. Maybe importing mxnet first triggers importing numpy, causing the numpy BLAS implementation to be loaded?
   
   yeah it's
   
   > The problem here is that we can never know what BLAS libraries are linked into other Python packages that our users happen to use. NumPy may be a popular one, but other packages may also link different BLAS implementations. The same issue applies to the OpenMP dependency.
   > 
   > Did you statically link BLAS implementation? You can refer to #17751. For dynamic linking, the loader will load the implementation referenced by the first package that is loaded and not load other BLAS implementations if the BLAS symbols are already available. Maybe importing mxnet first triggers importing numpy, causing the numpy BLAS implementation to be loaded?
   
   Right it was a dynamic link. Is there a way to link it statically when building from source?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-670229390


   Thanks! @szha @leezu  I just looked into adding a suffix to cblas/lapack calls within our code base. There are ~50 places in a handful of different files that also need to be changed. This makes me think if this change is too heavy.  Also 1. openblas is used in TVM, so they will need to make the same change to be consistent with us. 2. like @leezu mentioned finding 64 openblas is not well supported and distros also don't have a unified solution to it.  For our purpose of supporting large tensors in 2.0, if we could link 64 openblas statically for our soon-to-come release, I would still think that’s the best solution.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-668872058


   Alright I tried the static build scripts under /tools
   This is what I got 
   ```
   ubuntu@ip-172-31-12-248:~/incubator-mxnet$ ldd lib/libmxnet.so 
   	linux-vdso.so.1 (0x00007fff525c4000)
   	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb1116b7000)
   	libopenblas.so.0 => /home/ubuntu/incubator-mxnet/lib/libopenblas.so.0 (0x00007fb10efc4000)
   	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb10edbc000)
   	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb10eb9d000)
   	libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007fb10e96e000)
   	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb10e5e5000)
   	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb10e247000)
   	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb10e02f000)
   	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb10dc3e000)
   	/lib64/ld-linux-x86-64.so.2 (0x00007fb1192fc000)
   	libgfortran.so.4 => /home/ubuntu/incubator-mxnet/lib/libgfortran.so.4 (0x00007fb10d85f000)
   	libquadmath.so.0 => /home/ubuntu/incubator-mxnet/lib/libquadmath.so.0 (0x00007fb10d61f000)
   ```
   
   It looks like libmxnet is still dependent on libopenblas. Am I doing this wrong or should I config the script in some ways? @leezu 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-671559514


   > > One more thing is that currently the static build script defaults to dynamically linking openblas.
   > 
   > There's a step that disable/remove shared object for dynamic linking.
   
   Yes, I did that on my machine and was able to build the wheel. I think we should add that back too


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] access2rohit commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
access2rohit commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-669596818


   > is this from pip package or building from source? for pip, we statically link blas/lapacke and stripped all symbols that are not exposed. this means that the blas/lapacke functions are not in the symbol table.
   
   @Zha0q1 you may want to check if these steps are still there in new build scripts and add as necessary


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-668296177


   The problem here is that we can never know what BLAS libraries are linked into other Python packages that our users happen to use. NumPy may be a popular one, but other packages may also link different BLAS implementations. The same issue applies to the OpenMP dependency.
   
   Did you statically link BLAS implementation? You can refer to https://github.com/apache/incubator-mxnet/pull/17751. For dynamic linking, the loader will load the implementation referenced by the first package that is loaded and not load other BLAS implementations if the BLAS symbols are already available. Maybe importing mxnet first triggers importing numpy, causing the numpy BLAS implementation to be loaded?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-670308041


   > > There are ~50 places in a handful of different files that all need to be changed. This makes me think if this change is too heavy
   > 
   > Why not follow the approach in numpy and define a macro at a central place? https://github.com/numpy/numpy/pull/15069/files#diff-4538717a0246e7d9363e76a2e3fc835e
   > 
   > > finding 64 openblas is not well supported
   > 
   > You can edit https://github.com/apache/incubator-mxnet/blob/master/cmake/Modules/FindOpenBLAS.cmake. You will need to edit this file in any case, even if you chose not to rely on symbol suffixes. That's because ILP64 openblas would typically be named `libopenblas64` (or `libopenblas64_` with suffixes). The mid-term approach is work with upstream so that we can eventually delete `FindOpenBLAS.cmake` and just rely on upstream cmake feature.
   > 
   > > For our purpose of supporting large tensors in 2.0, if we could link 64 openblas statically for our soon-to-come release, I would still think that’s the best solution.
   > 
   > There are two different cases: One for the staticbuild for pip, where static linkage will be preferred. For the general CMakeLists.txt, why should we restrict our users to static linkage?
   > 
   > > Also 1. openblas is used in TVM, so they will need to make the same change to be consistent with us
   > 
   > How do you ensure this consistency without symbol suffix? Does TVM support ILP64? If TVM expects standard 32bit blas but you link ILP64 blas with the same symbol names, wouldn't there be issues?
   > 
   > > BTW int 32 blas will work for tensors with size > INT_MAX (2**31 - 1 ), it's when a dim is > INT_MAX we must use int 64 blas, because in the function declarations they use int 32 for stride
   > 
   > If `dim is > INT_MAX` is supported by MXNet, our BLAS operators need to either return the correct result or raise an error. @access2rohit told me that his PR making large tensor the default would just silently return wrong result in this case.
   
   I created a simple PR. I am trying to learn about the scope of this change as I go


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-670623898


   @sandeep-krishnamurthy would you take a look?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
leezu commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-669601375


   If you want to make this work via static linking, you need to follow
   
   "If -Bsymbolic is specified, then at the time of creating a shared object ld will attempt to bind references to global symbols to definitions within the shared library. The default is to defer binding to runtime."
   
   https://stackoverflow.com/a/7202917/2560672


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-669300523


   As discussed offline, we should avoid requiring all users to build openblas from source, especially as distributions start to provide `libopenblas64_.so` versions with OpenBLAS ILP64 build and `64_` suffix in symbol name. Fedora in https://bugzilla.redhat.com/show_bug.cgi?id=1295965. Debian testing / Ubuntu 20.04 provide `libopenblas64.so` without suffix per https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=878121 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=910924 I initiated a discussion on https://lists.debian.org/debian-science/2020/08/threads.html about providing the same package in Debian.
   
   To avoid symbol version clashes as reported in these issues, we must require the suffixed symbol names or static link. We can work with cmake upstream on a standard approach to detecting 64bit suffixed openblas as part of https://cmake.org/cmake/help/latest/module/FindBLAS.html We can backport the upstream recommended approach to our https://github.com/apache/incubator-mxnet/tree/master/cmake/upstream
   
   To support older distributions that do not ship with `libopenblas64_.so`, we should include instructions for building `libopenblas64_.so` from source in our guide.
   
   Static linking is a great solution for the pip distribution build, but I don't think we should settle on a solution that requires static linking, because there are already better solution established as best practice (ie. symbol suffix) and this will restrict the versatility of MXNet source build and make it harder to package MXNet as part of Debian or other distributions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-670233881


   BTW int 32 blas will work for tensors with size > INT_MAX (2**31 - 1 ), it's just when a dim is > INT_MAX we must use int 64 blas, because in the function declarations they use  int 32 for stride


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-669601375


   If you want to make this work via static linking, you may need to use
   
   ```
   -Bsymbolic
     When creating a shared library, bind references to global symbols to the 
     definition within the shared library, if any. Normally, it is possible 
     for a program linked against a shared library to override the definition 
     within the shared library. 
   
     This option is only meaningful on ELF platforms which support shared libraries.
   ```
   
   See also https://stackoverflow.com/questions/7216973/is-there-a-downside-to-using-bsymbolic-functions
   
   But instead we can just workaround the issue by adopting the 64_ suffix convention.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-668305438


   
   > is this from pip package or building from source? for pip, we statically link blas/lapacke and stripped all symbols that are not exposed. this means that the blas/lapacke functions are not in the symbol table.
   
   This if for building from source. I just joined supporting large tensors for v1.x and 2.0


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-669601375


   If you want to make this work via static linking, you need to follow
   
   "If -Bsymbolic is specified, then at the time of creating a shared object ld will attempt to bind references to global symbols to definitions within the shared library. The default is to defer binding to runtime."
   
   https://stackoverflow.com/a/7202917/2560672
   
   But instead we can just workaround the issue by adopting the 64_ suffix convention.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-671503674


   One more thing is that currently the static build script defaults to dynamically linking openblas. We would need to patch this; I can share my changes once we have a final consensus on how to support openblas 64 in the future.
   
   Also, opencv would fail to link with this error:
   ```
   ../staticdeps/lib/libopenblas.a(sormbr.o): In function `sormbr_':
   sormbr.f:(.text+0x3a6): undefined reference to `_gfortran_concat_string'
   sormbr.f:(.text+0x562): undefined reference to `_gfortran_concat_string'
   sormbr.f:(.text+0x5d1): undefined reference to `_gfortran_concat_string'
   sormbr.f:(.text+0x621): undefined reference to `_gfortran_concat_string'
   ../staticdeps/lib/libopenblas.a(sormlq.o): In function `sormlq_':
   sormlq.f:(.text+0x30b): undefined reference to `_gfortran_concat_string'
   ../staticdeps/lib/libopenblas.a(sormlq.o):sormlq.f:(.text+0x842): more undefined references to `_gfortran_concat_string' follow
   ../staticdeps/lib/libopenblas.a(slartg.o): In function `slartg_':
   slartg.f:(.text+0xb1): undefined reference to `_gfortran_pow_r4_i8'
   ../staticdeps/lib/libopenblas.a(dlartg.o): In function `dlartg_':
   dlartg.f:(.text+0xb3): undefined reference to `_gfortran_pow_r8_i8'
   ../staticdeps/lib/libopenblas.a(slaed6.o): In function `slaed6_':
   slaed6.f:(.text+0x109): undefined reference to `_gfortran_pow_r4_i8'
   ../staticdeps/lib/libopenblas.a(dlaed6.o): In function `dlaed6_':
   dlaed6.f:(.text+0x10e): undefined reference to `_gfortran_pow_r8_i8'
   collect2: error: ld returned 1 exit status
   ```
   I had to turn off opencv for the mxnet build. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
leezu commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-669590895


   > My solution was to have build instructions for everyone to build openblas from source
   
   It's great to have those instructions where required. But we should also look at the trend and build a solution that aligns with the work of others.
   
   > Since distros will start to provide _64 versions of libraries we don't need to worry about those then.
   
   We may need to add support for those, it's not automatic. For example, https://cmake.org/cmake/help/latest/module/FindBLAS.html does not currently distinguish the 64 and 32bit versions as the maintainers may not be aware of the use-case.
   
   This doesn't all have to happen at once. We just need to ensure that our approach remains compatible with what we eventually would like to achieve.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-668306079


   > is this from pip package or building from source? for pip, we statically link blas/lapacke and stripped all symbols that are not exposed. this means that the blas/lapacke functions are not in the symbol table.
   
   If we are also distributing int-64 mxnet via pip this is probably not an issue. Otherwise user building from source may run into this 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-670250269


   > There are ~50 places in a handful of different files that all need to be changed. This makes me think if this change is too heavy
   
   Why not follow the approach in numpy and define a macro at a central place? https://github.com/numpy/numpy/pull/15069/files#diff-4538717a0246e7d9363e76a2e3fc835e
   
   > finding 64 openblas is not well supported
   
   You can edit https://github.com/apache/incubator-mxnet/blob/master/cmake/Modules/FindOpenBLAS.cmake. You will need to edit this file in any case, even if you chose not to rely on symbol suffixes. That's because ILP64 openblas would typically be named `libopenblas64` (or `libopenblas64_` with suffixes). The mid-term approach is work with upstream so that we can eventually delete `FindOpenBLAS.cmake` and just rely on upstream cmake feature.
   
   >  For our purpose of supporting large tensors in 2.0, if we could link 64 openblas statically for our soon-to-come release, I would still think that’s the best solution.
   
   There are two different cases: One for the staticbuild for pip, where static linkage will be preferred. For the general CMakeLists.txt, why should we restrict our users to static linkage?
   
   >  Also 1. openblas is used in TVM, so they will need to make the same change to be consistent with us
   
   How do you ensure this consistency without symbol suffix? Does TVM support ILP64? If TVM expects standard 32bit blas but you link ILP64 blas with the same symbol names, wouldn't there be issues?
   
   > BTW int 32 blas will work for tensors with size > INT_MAX (2**31 - 1 ), it's when a dim is > INT_MAX we must use int 64 blas, because in the function declarations they use int 32 for stride
   
   If `dim is > INT_MAX` is supported by MXNet, our BLAS operators need to either return the correct result or raise an error. @access2rohit told me that his PR making large tensor the default would just silently return wrong result in this case.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-670233881


   BTW int 32 blas will work for tensors with size > INT_MAX (2**31 - 1 ), it's just when a dim is > INT_MAX we must use int 64 blas, because in the function declaration they use  int 32 for stride


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] access2rohit commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
access2rohit commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-669406951


   > As discussed offline, we should avoid requiring all users to build openblas from source, especially as distributions start to provide `libopenblas64_.so` versions with OpenBLAS ILP64 build and `64_` suffix in symbol name. Fedora in https://bugzilla.redhat.com/show_bug.cgi?id=1295965. Debian testing / Ubuntu 20.04 provide `libopenblas64.so` without suffix per https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=878121 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=910924 I initiated a discussion on debian-science mailinglist about providing the same package in Debian. It's also tracked in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=967951
   > 
   > To avoid symbol version clashes as reported in these issues, we must require the suffixed symbol names or static link. We can work with cmake upstream on a standard approach to detecting 64bit suffixed openblas as part of https://cmake.org/cmake/help/latest/module/FindBLAS.html We can backport the upstream recommended approach to our https://github.com/apache/incubator-mxnet/tree/master/cmake/upstream
   > 
   > To support older distributions that do not ship with `libopenblas64_.so`, we should include instructions for building `libopenblas64_.so` from source in our guide.
   > 
   > Static linking is a great solution for the pip distribution build, but I don't think we should settle on a solution that requires static linking, because there are already better solution established as best practice (ie. symbol suffix) and this will restrict the versatility of MXNet source build and make it harder to package MXNet as part of Debian or other distributions.
   
   @leezu Thanks for initiating the discussion on debian science. My solution was to have build instructions for everyone to build openblas from source(static linking would avoid issues for cloud users using DLAMI on EC2 that causes issues with openBLAS installation). That way build instructions are identical regardless of the distro selection. Most customers use pip install so for them there isn't a difference in experience. 
   
   Since distros will start to provide _64 versions of libraries we don't need to worry about those then.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-671503674


   One more thing is that currently the static build script defaults to dynamically linking openblas. We would need to make it a static link; I can share my changes once we have a final consensus on how to support openblas 64 in the future.
   
   Also, opencv would fail to link with this error:
   ```
   ../staticdeps/lib/libopenblas.a(sormbr.o): In function `sormbr_':
   sormbr.f:(.text+0x3a6): undefined reference to `_gfortran_concat_string'
   sormbr.f:(.text+0x562): undefined reference to `_gfortran_concat_string'
   sormbr.f:(.text+0x5d1): undefined reference to `_gfortran_concat_string'
   sormbr.f:(.text+0x621): undefined reference to `_gfortran_concat_string'
   ../staticdeps/lib/libopenblas.a(sormlq.o): In function `sormlq_':
   sormlq.f:(.text+0x30b): undefined reference to `_gfortran_concat_string'
   ../staticdeps/lib/libopenblas.a(sormlq.o):sormlq.f:(.text+0x842): more undefined references to `_gfortran_concat_string' follow
   ../staticdeps/lib/libopenblas.a(slartg.o): In function `slartg_':
   slartg.f:(.text+0xb1): undefined reference to `_gfortran_pow_r4_i8'
   ../staticdeps/lib/libopenblas.a(dlartg.o): In function `dlartg_':
   dlartg.f:(.text+0xb3): undefined reference to `_gfortran_pow_r8_i8'
   ../staticdeps/lib/libopenblas.a(slaed6.o): In function `slaed6_':
   slaed6.f:(.text+0x109): undefined reference to `_gfortran_pow_r4_i8'
   ../staticdeps/lib/libopenblas.a(dlaed6.o): In function `dlaed6_':
   dlaed6.f:(.text+0x10e): undefined reference to `_gfortran_pow_r8_i8'
   collect2: error: ld returned 1 exit status
   ```
   I had to turn off opencv for the mxnet build. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-671503674


   One more thing is that now the static build script defaults to dynamically linking openblas. We would need to patch this; I can share my changes once we have a final consensus on how to support openblas 64 in the future.
   
   Also, opencv would fail to link with this error:
   ```
   ../staticdeps/lib/libopenblas.a(sormbr.o): In function `sormbr_':
   sormbr.f:(.text+0x3a6): undefined reference to `_gfortran_concat_string'
   sormbr.f:(.text+0x562): undefined reference to `_gfortran_concat_string'
   sormbr.f:(.text+0x5d1): undefined reference to `_gfortran_concat_string'
   sormbr.f:(.text+0x621): undefined reference to `_gfortran_concat_string'
   ../staticdeps/lib/libopenblas.a(sormlq.o): In function `sormlq_':
   sormlq.f:(.text+0x30b): undefined reference to `_gfortran_concat_string'
   ../staticdeps/lib/libopenblas.a(sormlq.o):sormlq.f:(.text+0x842): more undefined references to `_gfortran_concat_string' follow
   ../staticdeps/lib/libopenblas.a(slartg.o): In function `slartg_':
   slartg.f:(.text+0xb1): undefined reference to `_gfortran_pow_r4_i8'
   ../staticdeps/lib/libopenblas.a(dlartg.o): In function `dlartg_':
   dlartg.f:(.text+0xb3): undefined reference to `_gfortran_pow_r8_i8'
   ../staticdeps/lib/libopenblas.a(slaed6.o): In function `slaed6_':
   slaed6.f:(.text+0x109): undefined reference to `_gfortran_pow_r4_i8'
   ../staticdeps/lib/libopenblas.a(dlaed6.o): In function `dlaed6_':
   dlaed6.f:(.text+0x10e): undefined reference to `_gfortran_pow_r8_i8'
   collect2: error: ld returned 1 exit status
   ```
   I had to turn off opencv for the mxnet build. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-670657830


   For reference, here are the things I tried:
   
   1. build from source mxnet w/ 64 openblas no suffix + mkl numpy ==> mkl shadows 64 openblas
   2. build from source mxnet w/ 64 openblas no suffix + 64 openblas numpy ==> works; calls correct cblas symbol
   3. current static build script mxnet w/ 64 openblas no suffix + mkl numpy ==> mkl shadows 64 openblas
   4. static build script mxnet w/ 64 openblas no suffix + mkl numpy + https://github.com/apache/incubator-mxnet/pull/18872 ==> works; calls correct cblas symbol
   5. build from source mxnet w/64 openblas suffixed + mkl numpy + https://github.com/apache/incubator-mxnet/pull/18873 ==> works; calls correct cblas symbol


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-669300523


   As discussed offline, we should avoid requiring all users to build openblas from source, especially as distributions start to provide `libopenblas64_.so` versions with OpenBLAS ILP64 build and `64_` suffix in symbol name. Fedora in https://bugzilla.redhat.com/show_bug.cgi?id=1295965. Debian testing / Ubuntu 20.04 provide `libopenblas64.so` without suffix per https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=878121 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=910924 I initiated a discussion on debian-science mailinglist about providing the same package in Debian. It's also tracked in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=967951
   
   To avoid symbol version clashes as reported in these issues, we must require the suffixed symbol names or static link. We can work with cmake upstream on a standard approach to detecting 64bit suffixed openblas as part of https://cmake.org/cmake/help/latest/module/FindBLAS.html We can backport the upstream recommended approach to our https://github.com/apache/incubator-mxnet/tree/master/cmake/upstream
   
   To support older distributions that do not ship with `libopenblas64_.so`, we should include instructions for building `libopenblas64_.so` from source in our guide.
   
   Static linking is a great solution for the pip distribution build, but I don't think we should settle on a solution that requires static linking, because there are already better solution established as best practice (ie. symbol suffix) and this will restrict the versatility of MXNet source build and make it harder to package MXNet as part of Debian or other distributions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-669300523






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
leezu edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-669300523


   As discussed offline, we should avoid requiring all users to build openblas from source, especially as distributions also provide `libopenblas64_.so` versions with OpenBLAS ILP64 build and `64_` suffix in symbol name. Fedora in https://bugzilla.redhat.com/show_bug.cgi?id=1295965 and Debian testing / Ubuntu 20.04 as per https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=878121 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=910924 (though Debian may not provide the `64_` suffix yet.
   
   To avoid symbol version clashes as reported in these issues, we must require the suffixed symbol names or static link. We can work with cmake upstream on a standard approach to detecting 64bit suffixed openblas as part of https://cmake.org/cmake/help/latest/module/FindBLAS.html We can backport the upstream recommended approach to our https://github.com/apache/incubator-mxnet/tree/master/cmake/upstream
   
   To support older distributions that do not ship with `libopenblas64_.so`, we can include instructions for building `libopenblas64_.so` from source in our guide.
   
   We do not want to settle on a solution that requires static linking, because there are already better solution established as best practice (ie. symbol suffix).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] access2rohit commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
access2rohit commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-669596111


   Makes sense 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] szha commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
szha commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-671542505


   > One more thing is that currently the static build script defaults to dynamically linking openblas.
   
   There's a step that disable/remove shared object for dynamic linking.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-668305438


   > is this from pip package or building from source? for pip, we statically link blas/lapacke and stripped all symbols that are not exposed. this means that the blas/lapacke functions are not in the symbol table.
   
   This if for building from source. Are we planing to support large tensors by default in the future? Currently it's controlled by a flag.
   (I just joined the effort to support large tensors for v1.x and 2.0.)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
leezu commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-668296177


   The problem here is that we can never know what BLAS libraries are linked into other Python packages that our users happen to use. NumPy may be a popular one, but other packages may also link different BLAS implementations. The same issue applies to the OpenMP dependency.
   
   Did you statically link BLAS implementation? For dynamic linking, the loader will load the implementation referenced by the first package that is loaded and not load other BLAS implementations if the BLAS symbols are already available. Maybe importing mxnet first triggers importing numpy, causing the numpy BLAS implementation to be loaded?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-671503674


   One more thing is that currently the static build script defaults to dynamically linking openblas. We would need to make it a static link; I can share my changes (build openblas 32 => 64; dynamic link => static link) once we have a final consensus on how to support openblas 64 in the future.
   
   Also, opencv would fail to link with this error:
   ```
   ../staticdeps/lib/libopenblas.a(sormbr.o): In function `sormbr_':
   sormbr.f:(.text+0x3a6): undefined reference to `_gfortran_concat_string'
   sormbr.f:(.text+0x562): undefined reference to `_gfortran_concat_string'
   sormbr.f:(.text+0x5d1): undefined reference to `_gfortran_concat_string'
   sormbr.f:(.text+0x621): undefined reference to `_gfortran_concat_string'
   ../staticdeps/lib/libopenblas.a(sormlq.o): In function `sormlq_':
   sormlq.f:(.text+0x30b): undefined reference to `_gfortran_concat_string'
   ../staticdeps/lib/libopenblas.a(sormlq.o):sormlq.f:(.text+0x842): more undefined references to `_gfortran_concat_string' follow
   ../staticdeps/lib/libopenblas.a(slartg.o): In function `slartg_':
   slartg.f:(.text+0xb1): undefined reference to `_gfortran_pow_r4_i8'
   ../staticdeps/lib/libopenblas.a(dlartg.o): In function `dlartg_':
   dlartg.f:(.text+0xb3): undefined reference to `_gfortran_pow_r8_i8'
   ../staticdeps/lib/libopenblas.a(slaed6.o): In function `slaed6_':
   slaed6.f:(.text+0x109): undefined reference to `_gfortran_pow_r4_i8'
   ../staticdeps/lib/libopenblas.a(dlaed6.o): In function `dlaed6_':
   dlaed6.f:(.text+0x10e): undefined reference to `_gfortran_pow_r8_i8'
   collect2: error: ld returned 1 exit status
   ```
   I had to turn off opencv for the mxnet build. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
leezu commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-670250269


   > There are ~50 places in a handful of different files that all need to be changed. This makes me think if this change is too heavy
   
   Why not follow the approach in numpy and define a macro at a central place? https://github.com/numpy/numpy/pull/15069/files#diff-4538717a0246e7d9363e76a2e3fc835e
   
   > finding 64 openblas is not well supported
   
   You can edit https://github.com/apache/incubator-mxnet/blob/master/cmake/Modules/FindOpenBLAS.cmake. You will need to edit this file in any case, even if you chose not to rely on symbol suffixes. That's because ILP64 openblas would typically be named `libopenblas64` (or `libopenblas64_` with suffixes). The mid-term approach is work with upstream so that we can eventually delete `FindOpenBLAS.cmake` and just rely on upstream cmake feature.
   
   >  For our purpose of supporting large tensors in 2.0, if we could link 64 openblas statically for our soon-to-come release, I would still think that’s the best solution.
   
   There are two different cases: One for the staticbuild for pip, where static linkage will be preferred. For the general CMakeLists.txt, why should we restrict our users to static linkage?
   
   >  Also 1. openblas is used in TVM, so they will need to make the same change to be consistent with us
   
   How do you ensure this consistency without symbol suffix? Does TVM support ILP64? If TVM expects standard 32bit blas but you link ILP64 blas wouldn't there be issues?
   
   > BTW int 32 blas will work for tensors with size > INT_MAX (2**31 - 1 ), it's when a dim is > INT_MAX we must use int 64 blas, because in the function declarations they use int 32 for stride
   
   If `dim is > INT_MAX` is supported by MXNet, our BLAS operators need to either return the correct result or raise an error. @access2rohit told me that his PR making large tensor the default would just silently return wrong result in this case.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] access2rohit commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
access2rohit commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-668330577


   @Zha0q1  When building pypi wheel. openBLAS is linked statically and is inside `libmxnet.so`. So for dynamic inking we can layout guidelines for users saying that if they want to build from source they need to reinstall numpy with openblas built from source. 
   
   We can change the build from source instructions for mxnet.2.0 to always build openBLAS from source and statically link it. @leezu @szha what do you think ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-671503674


   One more thing is that currently the static build script defaults to dynamically linking openblas. We would need to make it a static link; I can share my changes (build openblas 32 => 64; dynamic link => static link) once we have a final consensus on how to support openblas 64 in the future.
   
   Also, opencv would fail to link with this error:
   ```
   ../staticdeps/lib/libopenblas.a(sormbr.o): In function `sormbr_':
   sormbr.f:(.text+0x3a6): undefined reference to `_gfortran_concat_string'
   sormbr.f:(.text+0x562): undefined reference to `_gfortran_concat_string'
   sormbr.f:(.text+0x5d1): undefined reference to `_gfortran_concat_string'
   sormbr.f:(.text+0x621): undefined reference to `_gfortran_concat_string'
   ../staticdeps/lib/libopenblas.a(sormlq.o): In function `sormlq_':
   sormlq.f:(.text+0x30b): undefined reference to `_gfortran_concat_string'
   ../staticdeps/lib/libopenblas.a(sormlq.o):sormlq.f:(.text+0x842): more undefined references to `_gfortran_concat_string' follow
   ../staticdeps/lib/libopenblas.a(slartg.o): In function `slartg_':
   slartg.f:(.text+0xb1): undefined reference to `_gfortran_pow_r4_i8'
   ../staticdeps/lib/libopenblas.a(dlartg.o): In function `dlartg_':
   dlartg.f:(.text+0xb3): undefined reference to `_gfortran_pow_r8_i8'
   ../staticdeps/lib/libopenblas.a(slaed6.o): In function `slaed6_':
   slaed6.f:(.text+0x109): undefined reference to `_gfortran_pow_r4_i8'
   ../staticdeps/lib/libopenblas.a(dlaed6.o): In function `dlaed6_':
   dlaed6.f:(.text+0x10e): undefined reference to `_gfortran_pow_r8_i8'
   collect2: error: ld returned 1 exit status
   ```
   For my poc build I had to turn off opencv for the mxnet build. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] szha commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
szha commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-668304983


   is this from pip package or building from source? for pip, we statically link blas/lapacke and stripped all symbols that are not exposed. this means that the blas/lapacke functions are not in the symbol table.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] szha commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
szha commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-669626715


   Discussed offline and we need to re-enable the symbol whitelisting in cmake builds. They previously existed in make-based builds:
   https://github.com/apache/incubator-mxnet/blob/v1.6.x/make/config/libmxnet.ver
   https://github.com/apache/incubator-mxnet/blob/v1.6.x/make/config/libmxnet.sym
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-670657830


   For reference, here are the things I tried:
   
   1. build from source mxnet w/ 64 openblas no suffix + mkl numpy ==> mkl shadows 64 openblas
   2. build from source mxnet w/ 64 openblas no suffix + 64 openblas numpy ==> works; calls correct cblas symbol
   3. current static build script mxnet w/ 64 openblas no suffix + mkl numpy ==> mkl shadows 64 openblas
   4. static build script mxnet w/ 64 openblas no suffix + mkl numpy + https://github.com/apache/incubator-mxnet/pull/18872 ==> TODO
   5. build from source mxnet w/64 openblas suffixed + mkl numpy + https://github.com/apache/incubator-mxnet/pull/18873 ==> works; calls correct cblas symbol


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] leezu commented on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
leezu commented on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-669300523


   As discussed offline, we should avoid requiring all users to build openblas from source, especially as distributions also provide `libopenblas64_.so` versions with OpenBLAS ILP64 build and `64_` suffix in symbol name. Fedora in https://bugzilla.redhat.com/show_bug.cgi?id=1295965 and Debian testing / Ubuntu 20.04 in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=910924
   
   To avoid symbol version clashes as reported in these issues, we must require the suffixed symbol names. Further we should work with cmake upstream on a standard approach to detecting 64bit suffixed openblas as part of https://cmake.org/cmake/help/latest/module/FindBLAS.html To support older distributions that do not ship with `libopenblas64_.so`, we can include instructions for building `libopenblas64_.so` from source in our guide.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] Zha0q1 edited a comment on issue #18855: NumPy BLAS Clashing with MXNet BLAS

Posted by GitBox <gi...@apache.org>.
Zha0q1 edited a comment on issue #18855:
URL: https://github.com/apache/incubator-mxnet/issues/18855#issuecomment-668312227


   
   
   > The problem here is that we can never know what BLAS libraries are linked into other Python packages that our users happen to use. NumPy may be a popular one, but other packages may also link different BLAS implementations. The same issue applies to the OpenMP dependency.
   > 
   > Did you statically link BLAS implementation? You can refer to #17751. For dynamic linking, the loader will load the implementation referenced by the first package that is loaded and not load other BLAS implementations if the BLAS symbols are already available. Maybe importing mxnet first triggers importing numpy, causing the numpy BLAS implementation to be loaded?
   
   Right it was a dynamic link. Is there a way to link it statically when building from source?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org