You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mxnet.apache.org by GitBox <gi...@apache.org> on 2022/03/10 20:27:12 UTC

[GitHub] [incubator-mxnet] timmeh87 opened a new issue #20954: Windows CPP_PACKAGE+GPU Build Fails inside python at opwrapper stage under certain conditions (using python version >= 3.8 or having certain GPU)!

timmeh87 opened a new issue #20954:
URL: https://github.com/apache/incubator-mxnet/issues/20954


   **This bug is relevant to 2.0-RC0 which I am using, but as far as I understand it, it applies to all previous versions and as people update their python version, all the older versions of the mxnet build will also break.** 
   
   ## Description
   For about a week I have been pulling out hair trying to get mxnet to build. There are actually several bugs with building on vs2019 that prevent it from completing and I had to make modifications to the source code:
   
   1) replace floor() with floorf(), ceil with ceilf(), round() with roundf() everywhere (As per a discussion on the blender forums)
   https://devtalk.blender.org/t/cuda-compile-error-windows-10/17886/5
   2) instantiate 6 templates as per #17633
   3) sneak lapack.lib into the linker because for some reason the cmake script assumes windows does not have it and excludes it on purpose (I got my OpenBLAS and Lapack from vcpkg if that matters)
   
   I used the same flags as the windows GPU CI script, except I also enabled the USE_CPP_PACKAGE because I require it. After doing those things the libmxnet.dll is produced, but the build fails in python when generating the opwrapper
   
   Three other things that are very important about my system: 
   1) Python 3.10
   2) I have a gtx660 installed, which is a device that supports maximum mxnet_30
   3) I have CUDA 11.2 installed, because I am trying to make builds for someone else. This machine is a powerful build machine so we dont have to do builds at home on our laptops. I am not going to use MXNet on my 660, it just happens to be there on the build machine.
   
   Ok so when the build gets to python, it fails in two different ways due to two different problems
   1) python >= 3.8 changed the search path for dlls, it no longer searches PATH so when loading libmxnet.dll, it fails because the cudart DLL is missing. 
   
   The error message is 
   FileNotFoundError: Could not find module 'F:\incubator-mxnet\build\Debug\libmxnet.dll' (or one of its dependencies). Try using the full path with constructor syntax.
   
   The fix to this, a call to  "os.add_dll_directory" MUST be used for python versions >= 3.8
   https://docs.python.org/3/library/os.html#os.add_dll_directory
   https://issueantenna.com/repo/academysoftwarefoundation/imath/issues/238
   
   2) after loading the cuda dlls, the code now crashes with a mysterious error about "writing 000000000" when in fact it is trying to load a version of mxnet_xx.dll which does not exist as a file on this computer. During automatically determining the version to use, it arrives at a DLL which does not exist and tries to load it anyways
   
   ### Error Message
   
   **first:**
   `FileNotFoundError: Could not find module 'F:\incubator-mxnet\build\Debug\libmxnet.dll' (or one of its dependencies). Try using the full path with constructor syntax.`
   
   **after adding cudart_xxx to the search path (this message is from python 10, it might vary based on your python version):**
   `OSError: exception: access violation writing 0x0000000000000000`
   
   ### Steps to reproduce
   
   - Be on windows 10 
   - Build with CPP_PACKAGE and CUDA enabled
   - Have a python version >3.8 (bug in mxnet)
     OR
   - Have a GPU installed that is so old the installed CUDA version does not support it (needs documentation in mxnet)
   
   ## My Solution
   The solution for cudart DLL:
   - if you just need to to work, copy the dll from the CUDA folder to right beside libmxnet.dll
   - to fix this in the python code the python version has to be detected and then call "os.add_dll_directory" with the cuda folder
   
   The solution to proceed with old hardware:
   - go into "warp_dll.cpp" and change the code that detects the version of mxnet_xx.dll. Hard-code it to the version that you have built, for example 
   	'wsprintfW(dll_name, L"mxnet_52.dll", version);'
   - re-run the build, it should rebuild libmxnet.dll, then the python should execute
   - undo the change to the code, and rebuild again to go back to a normal binary that will detect the correct version of mxnet based on hardware (as long as you HAVE the hardware 🙃)
   
   ## Better solutions?
   - the solution to find cudart dll from python is already "better" and should be implemented ASAP, python 3.7 is the oldest version available on the windows store and will soon be deprecated
   - I think that it should be documented right at the top of the main CPP build instructions that having supported hardware is a hard requirement. It might seem obvious to some people but there are situations where people like me might want to build the code even with no GPU at all. what would happen then? I dont know, someone that knows should write that into the instructions
   - Mxnet could try a little harder when it is searching for DLLs, after finding out that it is about to load a DLL that does not exist, maybe spit out an error message or something with the name of the dll and a message to get better hardware or rebuild
   - The build could pull the same screwy crap that i tried to do to enable building on any platform with a non-suitable GPU - but maybe thats complicated?
   - Mxnet could look at some configuration file when deciding what core DLL to choose and if there is an overriding value it just uses that instead of autodetect.
   
   - note: you might have been wondering "why not just build mxnet_30" and the answer is because it was deprecated in cuda 11.2, so when you try to do that, it fails and says "that is fully deprecated". I would have to uninstall it and install cuda 10. 
   
   ## Environment
   
   <details>
   <summary>Environment Information</summary>
   
   ```
   ----------Python Info----------
   Version      : 3.10.2
   Compiler     : MSC v.1929 64 bit (AMD64)
   Build        : ('tags/v3.10.2:a58ebcc', 'Jan 17 2022 14:12:15')
   Arch         : ('64bit', 'WindowsPE')
   ```
   
   </details>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] timmeh87 commented on issue #20954: Windows CPP_PACKAGE+GPU Build Fails inside python at opwrapper stage under certain conditions (using python version >= 3.8 or having certain GPU)!

Posted by GitBox <gi...@apache.org>.
timmeh87 commented on issue #20954:
URL: https://github.com/apache/incubator-mxnet/issues/20954#issuecomment-1064497988


   I actually found a much more elegant way to let the build proceed without messing with the C code at all.... just tell the python script to directly load one of the "mxnet_xx.dll" dlls instead of "libmxnet.dll". This can be done from the cmake script. The cuda path will come from there too. In the end only the cmake and python should be modified and it should let the build proceed on systems with insufficient GPUs. Should I make a PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] github-actions[bot] commented on issue #20954: Windows CPP_PACKAGE+GPU Build Fails inside python at opwrapper stage under certain conditions (using python version >= 3.8 or having certain GPU)!

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #20954:
URL: https://github.com/apache/incubator-mxnet/issues/20954#issuecomment-1064473571


   Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue.
   Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly.
   If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on [contributing to MXNet](https://mxnet.apache.org/community/contribute) and our [development guides wiki](https://cwiki.apache.org/confluence/display/MXNET/Developments).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] gregoryfdel commented on issue #20954: Windows CPP_PACKAGE+GPU Build Fails inside python at opwrapper stage under certain conditions (using python version >= 3.8 or having certain GPU)!

Posted by GitBox <gi...@apache.org>.
gregoryfdel commented on issue #20954:
URL: https://github.com/apache/incubator-mxnet/issues/20954#issuecomment-1065458400


   Adding onto this I also had similar issues building on windows.
   
   > just tell the python script to directly load one of the "mxnet_xx.dll" dlls instead of "libmxnet.dll"
   
   I just had this issue pop up, and thanks for letting me know about the fix; I also ended up rewriting `libinfo.py` to use the much more robust pathlib (https://pastebin.com/h530XpJ7) and in `base.py` ; `libinfo.find_lib_path()` -> `libinfo.find_lib_path(prefix="mxnet_75")`. It'll only work on windows, but I am leaving it as a reference.
   
   > the build fails in python when generating the opwrapper
   
   I ended up having to put all the *.dll files and their dependencies into the `cpp-package\scripts` folder to past that error, I just wanted to add to the point that something is wrong with the way files are being searched for.
   
   Also, as a final point of reference: this is what worked to get it to build out-of-source inside VS 2022 x64 command line with cmake/ninja using CUDA 11.6, CUDNN 8.3, and vcpkg:
   
   ```
   cmake -S "." -B "../build" -DCMAKE_TOOLCHAIN_FILE="C:/vcpkg/scripts/buildsystems/vcpkg.cmake"  -DCMAKE_INSTALL_PREFIX="D:/programs/mxnet/" -DCMAKE_BUILD_TYPE=Distribution -DCMAKE_C_COMPILER="cl.exe" -DCMAKE_CXX_COMPILER="cl.exe" -DMSVC_TOOLSET_VERSION=143 -G "Ninja" -DUSE_CPP_PACKAGE=1 -DUSE_CUDA=1 -DUSE_CUDNN=1 -DUSE_MKLDNN=1 -DUSE_NVRTC=1 -DUSE_OPENCV=1 -DUSE_OPENMP=1 -DBLAS=open -DUSE_DIST_KVSTORE=0 -DMXNET_BUILD_SHARED_LIBS=1 -DUSE_CXX14_IF_AVAILABLE=1
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org