You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/04/06 13:42:49 UTC

[GitHub] [incubator-mxnet] kpuatamazon opened a new issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

kpuatamazon opened a new issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980
 
 
   # Problem
   Not sure how much we care about MKL support but to the extent it still appears in the buld system, operator support should be consistent.  
   
   When compiled with MKL present (MKL is found in `/opt/intel`), MXNet calls MKL for `dot` and `batch_dot` and DNNL for `fully_connected`.  These are all GEMM operators; why is it inconsistent? This is making Sockeye decoding 22% slower (see below) without a workaround (below) to force use of MKL.  
   
   This inconsistency did not matter much in MXNet 1.5.0 because MKLDNN would delegate to MKL.  However, aa1074dc1704d3732ab205c43d48083ef8c69680 upgraded to MKLDNN 1.0, which hid the ability of MKLDNN to delegate to MKL: https://github.com/oneapi-src/oneDNN/commit/304915096d1def19999b963a60569ec46a882c16 .  (MKLDNN has since been renamed DNNL.)
   
   Since MKLDNN only hid support for delegating to MKL, it's possible to restore delegatation (see workaround).  
   
   # Testing 
   Tested with MXNet cfb474ba743d5ea85161bf19875488f4cb409d3c.  Compiled with mostly-default cmake settings:
   ```bash
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
   ```
   
   Then when I run 
   ```
   export MKL_VERBOSE=1
   export MKLDNN_VERBOSE=1
   python3
   Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
   [GCC 8.3.0] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import mxnet as mx
   Numpy + Intel(R) MKL: THREADING LAYER: (null)
   Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
   Numpy + Intel(R) MKL: preloading libiomp5.so runtime
   MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 3.00GHz lp64 intel_thread
   MKL_VERBOSE SDOT(2,0x1a0fdc0,1,0x1a0fdc0,1) 1.47ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24
   >>> a = mx.nd.ones(shape=(2,2))
   >>> mx.nd.FullyConnected(a,a,num_hidden=2,no_bias=True)
   dnnl_verbose,info,DNNL v1.1.2 (commit cb2cc7ac17ff4e2ef50805c7048d33256d82be4d)
   dnnl_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost
   dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb2ic2oc2,74.9971
   
   [[2. 2.]
    [2. 2.]]
   <NDArray 2x2 @cpu(0)>
   >>> a = mx.nd.ones(shape=(2,2,2))
   >>> mx.nd.batch_dot(a,a)
   MKL_VERBOSE SGEMM_BATCH(N,N,0x7fc3238b809c,0x7fc3238b80a0,0x7fc3238b80a4,0x7fc3238b80b4,0x7fc228010b90,0x7fc3238b80a8,0x7fc22800f770,0x7fc3238b80ac,0x7fc3238b80b8,0x7fc2280190e0,0x7fc3238b80b0,0x7fc3238b7fc8,0x7 363.79us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24
   
   [[[2. 2.]
     [2. 2.]]
   
    [[2. 2.]
     [2. 2.]]]
   >>> mx.nd.dot(a,a)
   MKL_VERBOSE SGEMM(N,N,4,4,2,0x7fc3238b8198,0x7fc2280043c0,4,0x7fc2280043c0,2,0x7fc3238b81a0,0x7fc228004580,4) 8.52us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24
   
   [[[[2. 2.]
      [2. 2.]]
   
     [[2. 2.]
      [2. 2.]]]
   
   
    [[[2. 2.]
      [2. 2.]]
   
     [[2. 2.]
      [2. 2.]]]]
   <NDArray 2x2x2x2 @cpu(0)>
   ```
   You can see DNNL is called for `FullyConnected` while MKL is called for `dot` and `batch_dot`.  
   
   # Performance impact
   I timed Sockeye decoding.  Commit https://github.com/apache/incubator-mxnet/commit/aa1074dc1704d3732ab205c43d48083ef8c69680 made decoding 22% slower (416.878s up from 342.037s for b5d07e30321da47d604b99048c1b57c03ec819b0) even with MKL installed in `/opt/intel/`.  
   
   | Commit | Compilation | Time(s) |
   | --- | --- | --- |
   | b5d07e3 (before MKLDNN 1.0 change) | Default | 342.037 | 
   | aa1074d (MKLDNN 1.0 change) | Default | 416.878 |
   | aa1074d (MKLDNN 1.0 change) | Workaround | 343.706 |
   | cfb474ba Recent | Default | 385.587 |
   | cfb474ba Recent | Workaround | 312.509 |
   
   (Default compilation is `cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..`; workaround compilation is below.)
   
   # Workaround
   Since DNNL hid support for delegating to MKL, it's still possible to turn delegation back on.
   ```bash
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..
   ```
   which compiles but then triggers a link error at runtime `OSError: /home/ubuntu/mxnet/build/3rdparty/mkldnn/src/libmkldnn.so.1: undefined symbol: cblas_gemm_s8u8s32_pack`
   So I kludged it with `export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so` and was then able to use MXNet at runtime.  There's probably a cleaner way of fixing the linkage.  
   
   # Recommended fix
   When compiled with MKL, MXNet should call MKL directly from `FullyConnected` like it already does for `dot` and `batch_dot`.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609875547
 
 
   This still uses MKLDNN for `FullyConnected`
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DBLAS=MKL -DUSE_MKL_IF_AVAILABLE=ON ..
   ```
   I was able to get MKL to run by disabling MKLDNN entirely:
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DBLAS=MKL -DUSE_MKL_IF_AVAILABLE=ON -DUSE_MKLDNN=OFF ..
   ```
   But then I lose other kernels like DNNL softmax and actually Sockeye randomly crashes:
   ```
   mxnet.base.MXNetError: MXNetError: Out of range value for value, value='inf', in operator _full(name="", dtype="float32", value="inf", ctx="cpu(0)", shape="(5, 1)")
   ```
   (bizarrely this happens after it's translated many sentences with that same code.)
   
   How do I achieve the fastest combination of DNNL softmax and MKL matrix multiply for `FullyConnected` using only documented options?  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-612836744
 
 
   Single-threaded benchmarks (`OMP_NUM_THREADS=1`) to confirm that it's not a difference in OMP library:
   | Shape | dot (MKL) | FullyConnected (MKL workaround) | FullyConnected (DNNL) | DNNL slowdown |
   | --- | --- | --- | --- | --- |
   | 5, 512, 512 | 0.0000914 | 0.0001016 | 0.0002448 | 2.40x |
   | 5, 512, 1536 | 0.0002939 | 0.0003051 | 0.0006576 | 2.15x |
   | 5, 512, 2048 | 0.0003828 | 0.0003919 | 0.0008582 | 2.18x |
   | 5, 2048, 512 | 0.0003681 | 0.0003785 | 0.0014638 | 3.86x |
   | 4, 512, 512 | 0.0000917 | 0.0001038 | 0.0002364 | 2.27x |
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] kpuatamazon edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpuatamazon edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-612874407
 
 
   @TaoLv Regarding your comment
   > how to improve the performance of FullyConnected when DNNL is used. In fact, I see the performance of DNNL primitive kernel is good, compared with the output of the script. 
   
   I ran with both MKL and DNNL verbose options: `OMP_NUM_THREADS=1 MKL_VERBOSE=1 DNNL_VERBOSE=1`
   
   [verbose.txt](https://github.com/apache/incubator-mxnet/files/4469425/verbose.txt)
   
   To summarize the verbose output in microseconds:
   | Shape | MKL raw | DNNL raw | DNNL raw slowdown | MKL Python overhead | DNNL Python overhead |
   | --- | --- | --- | --- | --- | --- |
   | 5, 512, 512 | 10.29 | 31.61 | 3.07x | 66.21 | 63.69 | 
   | 5, 512, 1536 | 17.27 | 90.26 | 5.22x | 68.23 | 78.44 | 
   | 5, 512, 2048 | 16.16 | 81.48 | 5.04x | 75.84 | 84.92 | 
   | 5, 2048, 512 | 16.84 | 97.20 | 5.77x | 77.46 | 91.70 | 
   | 4, 512, 512 | 8.74 | 15.16 | 1.73x | 69.26 | 79.84 | 
   
   
   Raw is defined by the time reported from verbose output; note I converted DNNL ms to microseconds.  
   Python overhead is defined as time reported by the above Python script (converted to microseconds) minus raw time.  Note that MKL used `dot` whilst DNNL used `FullyConnected`.  
   
   In both cases, I excluded the first call for each shape to allow for JIT/burn-in.  
   
   So we really have four issues:
   1. Raw speed as reported by MKL's verbose output is much faster than raw speed reported by DNNL's verbose output.  This is definitely a speed problem with DNNL.  
   2. In my opinion, when I compile with `-DBLAS` set to something, it should use that BLAS.  You can add an `auto` option that picks according to the priorities on the website.  
   3. It's weird that `dot` calls MKL but `FullyConnected` calls DNNL given that they are both GEMMs.  I understand @pengzhao-intel is working on this.  Though unless 1 and 2 above are fixed, it will only make performance worse.  
   4. MXNet / Python overhead is larger for `FullyConnected` than `dot`.  This is small compared to the DNNL vs MKL difference though.  We can also  question why there is so much overhead relative to the cost of the multiply.  
   
   Intel should fix DNNL but that will take time.  
   
   MXNet should make `-DBLAS` actually choose the BLAS routine both to support a short-term bypass and because it better matches user expectations of what a compilation flag does.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] emfomenk commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

emfomenk commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-610648623
 
 
   > I tried single-threaded (OMP_NUM_THREADS=1) and still saw a performance drop. There shouldn't be much difference between OMP libraries then right?
   
   Yeah, there should be zero difference in this case :)
   
   > > I would also suggest to try upgrading the library to the most recent version
   >
   > See my post above "I also tried the latest DNNL". Performance was unchanged.
   
   Didn't see that, thanks! Yeah, it looks like a performance bug indeed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] emfomenk commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

emfomenk commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-610441396
 
 
   > Do you think that enabling this option can improve the performance of inner product primitive for the above cases?
   
   @TaoLv, yes, this could be. Two reasons for that:
   1. If Intel MKL is linked, then most likely `libiomp5.so` will be used instead of `libgomp.so`. So, if you see the performance difference, I would suggest to test Intel MKL-DNN v1.0 / DNNL v1.1+ with `LD_PRELOAD=libiomp5.so`, just to confirm that the issue due to or not due to OpenMP run-time.
   2. Performance bug in Intel MKL-DNN v1.0 / DNNL v1.1+. We definitely fixed few performance issues in the library since v1.0.
   
   I would also suggest to try upgrading the library to the most recent version and check the performance.
   I strongly suggest not using this developer's option. It is unsupported, not tested, and as I said, it is most likely broken in the subsequent releases.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609843360
 
 
   Here's some benchmarks on 2fff11d4233814aa4ad07858779338090ec2132d (current tip of master) with the same Skylake c5.9xlarge.  DNNL is substantially slower than MKL. 
   
   (Which DNNL is in master?)
   
   ```python
   #!/usr/bin/env python3
   import mxnet as mx
   import time
   
   def time_procedure(shape, count, proc):
     rows, inner, cols = shape
     a = mx.nd.random_uniform(shape=(rows, inner), low=-1.0, high=1.0)
     b = mx.nd.random_uniform(shape=(cols, inner), low=-1.0, high=1.0)
     # Burn in
     proc(a, b, cols)
     mx.nd.waitall()
     begin = time.time()
     for i in range(0, count):
       proc(a, b, cols)
       mx.nd.waitall()
     return (time.time() - begin) / count
   
   shapes = [(5, 512, 512), (5,512,1536), (5,512,2048), (5,2048,512), (4,512,512)]
   count = 1000
   
   procedures = {
     "fullyconnected (DNNL)" : (lambda a, b, cols : mx.nd.FullyConnected(a, b, no_bias=True, num_hidden=cols)),
     "dot (MKL)" : (lambda a, b, cols : mx.nd.dot(a, b, transpose_b = True))
   }
   for s in shapes:
     print("Shape " + str(s))
     stats = {}
     for name, l in procedures.items():
       stats[name] = time_procedure(s, count, l)
       print("{:.7f} seconds for {}".format(stats[name], name))
   ```
   Run as `OMP_NUM_THREADS=4 ./mult_bench.py`:
   ```
   Shape (5, 512, 512)
   0.0000961 seconds for fullyconnected (DNNL)
   0.0000509 seconds for dot (MKL)
   Shape (5, 512, 1536)
   0.0002011 seconds for fullyconnected (DNNL)
   0.0000735 seconds for dot (MKL)
   Shape (5, 512, 2048)
   0.0002521 seconds for fullyconnected (DNNL)
   0.0001027 seconds for dot (MKL)
   Shape (5, 2048, 512)
   0.0003569 seconds for fullyconnected (DNNL)
   0.0001018 seconds for dot (MKL)
   Shape (4, 512, 512)
   0.0000946 seconds for fullyconnected (DNNL)
   0.0000496 seconds for dot (MKL)
   ```
   
   I don't really mind what the default BLAS implementation is.  But choosing MKL should require undocumented compile options.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] TaoLv commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

TaoLv commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609885325

Yes, `USE_MKLDNN` is ON by default on Linux. See https://github.com/apache/incubator-mxnet/blob/master/CMakeLists.txt#L44. If want to disable DNNL optimizations, you need set `USE_MKLDNN` to OFF explicitly in the cmake line.

> How do I achieve the fastest combination of DNNL softmax and MKL matrix multiply for FullyConnected using only documented options?

I think the real question is how to improve the performance of FullyConnected when DNNL is used. In fact, I see the performance of DNNL primitive kernel is good, compared with the output of the script. Take the last shape as an example (on my machine, not ec2):
```
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb4ic512oc512,0.0490723
0.0001380 seconds for fullyconnected (DNNL)
0.0000556 seconds for dot (MKL)
```

You can set DNNL_VERBOSE=1 when running the script to get the verbose output.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

[GitHub] [incubator-mxnet] kpu commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpu commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-610643813
 
 
   > If Intel MKL is linked
   
   MKL was linked in both cases and in fact called in both cases from `dot` just not from `FullyConnected`
   
   > libiomp5.so will be used instead of libgomp.so
   
   I tried single-threaded (`OMP_NUM_THREADS=1`) and still saw a performance drop.  There shouldn't be much difference between OMP libraries then right?  
   
   > I would also suggest to try upgrading the library to the most recent version
   
   See my post above "I also tried the latest DNNL".  Performance was unchanged.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] kpuatamazon edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpuatamazon edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609906743
 
 
   > I don't think it's possible at this moment. 
   
   Can we please have a build option to force MKL usage for all GEMM calls?  This is a pretty big performance regression.  
   
   > I think the real question is how to improve the performance of FullyConnected when DNNL is used.
   
   Hmmm, when DNNL delegates to MKL, that happens internally to DNNL, right?  If the MXNet wrapper is the problem, then why is delegating to MKL much faster?  
   
   Applying my workaround:
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..
   ninja -j 30
   export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
   export OMP_NUM_THREADS=4
   ./mult_bench.py #NB I edited the script to label things more clearly.  
   ```
   Output:
   ```
   Shape (5, 512, 512)
   0.0000628 seconds for fullyconnected (MKL workaround)
   0.0000488 seconds for dot (MKL)
   Shape (5, 512, 1536)
   0.0000816 seconds for fullyconnected (MKL workaround)
   0.0000661 seconds for dot (MKL)
   Shape (5, 512, 2048)
   0.0001109 seconds for fullyconnected (MKL workaround)
   0.0000957 seconds for dot (MKL)
   Shape (5, 2048, 512)
   0.0001078 seconds for fullyconnected (MKL workaround)
   0.0000954 seconds for dot (MKL)
   Shape (4, 512, 512)
   0.0000501 seconds for fullyconnected (MKL workaround)
   0.0000434 seconds for dot (MKL)
   ```
   There is something less efficient about `FullyConnected` than `dot` but it's nowhere near explaining DNNL's slowdown.  
   
   FYI I usually wear this hat (as opposed to @kpu) on Mondays so expect slower responses on other days.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] TaoLv commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

TaoLv commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-610171165
 
 
   @emfomenk Thanks for the explanation. Do you think that enabling this option can improve the performance of inner product primitive for the above cases?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] pengzhao-intel commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

pengzhao-intel commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-610395459
 
 
   @bartekkuncer please help take a look for this issue as well

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-613009431

In case somebody finds this issue and wants their optimized build, here is a different workaround that removes the need for `LD_PRELOAD`. Just do this before running cmake the first time:
```bash
export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
```
Then `cmake` can be run normally:
```bash
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
```
and the compiled MXNet can be run normally without any special environment variables.

To be clear, the above kludge is an undocumented abomination. The problem with `cmake -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include` is that it links against MKL twice. DNNL is hard-coded to link dynamically against `libmkl_rt.so`:
https://github.com/oneapi-src/oneDNN/blob/1b05a28eb9666efef83b281e4cc1936db5e6cf6c/cmake/MKL.cmake#L64
Then MXNet also links statically against MKL. And it still wants the shared library in `dlopen`. So we should just remove the shared library. How do we do that? Implement everything else directly without using the build file. Specifically it implements in `CXXFLAGS` what the rest of build system does https://github.com/oneapi-src/oneDNN/blob/1b05a28eb9666efef83b281e4cc1936db5e6cf6c/cmake/MKL.cmake#L65-L67

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609906743
 
 
   > I don't think it's possible at this moment. 
   Can we please have a build option to force MKL usage for all GEMM calls?  This is a pretty big performance regression.  
   
   > I think the real question is how to improve the performance of FullyConnected when DNNL is used.
   
   Hmmm, when DNNL delegates to MKL, that happens internally to DNNL, right?  If the MXNet wrapper is the problem, then why is delegating to MKL much faster?  
   
   Applying my workaround:
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..
   ninja -j 30
   export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
   export OMP_NUM_THREADS=4
   ./mult_bench.py #NB I edited the script to label things more clearly.  
   ```
   Output:
   ```
   Shape (5, 512, 512)
   0.0000628 seconds for fullyconnected (MKL workaround)
   0.0000488 seconds for dot (MKL)
   Shape (5, 512, 1536)
   0.0000816 seconds for fullyconnected (MKL workaround)
   0.0000661 seconds for dot (MKL)
   Shape (5, 512, 2048)
   0.0001109 seconds for fullyconnected (MKL workaround)
   0.0000957 seconds for dot (MKL)
   Shape (5, 2048, 512)
   0.0001078 seconds for fullyconnected (MKL workaround)
   0.0000954 seconds for dot (MKL)
   Shape (4, 512, 512)
   0.0000501 seconds for fullyconnected (MKL workaround)
   0.0000434 seconds for dot (MKL)
   ```
   There is something less efficient about `FullyConnected` than `dot` but it's nowhere near explaining DNNL's slowdown.  
   
   FYI I usually wear this hat (as opposed to @kpu) on Mondays so expect slower responses on other days.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] TaoLv commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

TaoLv commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609820709
 
 
   Thanks for raising the issue, @kpuatamazon. We plan to optimize the dot and batch_dot operator with DNNL MatMul primitive. But please note that, the performance of MatMul primitive was not optimized until v1.3 which was released last week. That's why we didn't integrate the primitive at the first time when it's introduced in v1.2.
   
   > When compiled with MKL, MXNet should call MKL directly from FullyConnected like it already does for dot and batch_dot.
   
   As mentioned above, dot and batch_dot will also be optimized with DNNL. And as DNNL is more dedicated on deep learning and has more friendly license, we will consider DNNL with high priority when both DNNL and MKL are enabled in compilation.
   
   Please provide a simple reproducer if you find FullyConnected based on DNNL is slower than it based on MKL BLAS.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] emfomenk commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

emfomenk commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-610158825
 
 
   @TaoLv, it is a developer's option to enable Intel MKL gemm back (instead of the jitted one). I would suggest to avoid using it, as it internal one (marked with leading underscore). I think it was even broken somewhere in v1.1 to v1.3...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] TaoLv edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

TaoLv edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609820709
 
 
   Thanks for raising the issue, @kpuatamazon. We plan to optimize the dot and batch_dot operator with DNNL MatMul primitive. But please note that, the performance of MatMul primitive was not optimized until v1.3 which was released last week. That's why we didn't integrate the primitive at the first time when it's introduced in v1.2.
   
   > When compiled with MKL, MXNet should call MKL directly from FullyConnected like it already does for dot and batch_dot.
   
   As mentioned above, dot and batch_dot will also be optimized with DNNL. And as DNNL is more dedicated on deep learning, we will consider DNNL with high priority when both DNNL and MKL are enabled in compilation.
   
   Please provide a simple reproducer if you find FullyConnected based on DNNL is slower than it based on MKL BLAS.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] TaoLv commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

TaoLv commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609860069
 
 
   Thank you @kpuatamazon. I will try to reproduce the performance issue and get back to you later.
   
   > (Which DNNL is in master?)
   
   It should be v1.2.2 but was down graded to v1.1.2 by https://github.com/apache/incubator-mxnet/pull/17084 (maybe by accident).
   
   > But choosing to use MKL should be 1) atomic and 2) not require undocumented compilation options.
   
   Not sure I understand what do you mean. But using MKL BLAS is documented in:
   - makefile: https://github.com/apache/incubator-mxnet/blob/master/make/config.mk#L120
   - cmake: https://github.com/apache/incubator-mxnet/blob/master/cmake/ChooseBlas.cmake#L19
   - website: https://mxnet.incubator.apache.org/get_started/build_from_source#math-library-selection 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-612874407
 
 
   @TaoLv Regarding your comment
   > how to improve the performance of FullyConnected when DNNL is used. In fact, I see the performance of DNNL primitive kernel is good, compared with the output of the script. 
   
   I ran with both MKL and DNNL verbose options: `OMP_NUM_THREADS=1 MKL_VERBOSE=1 DNNL_VERBOSE=1`
   
   [verbose.txt](https://github.com/apache/incubator-mxnet/files/4469425/verbose.txt)
   
   To summarize the verbose output in microseconds:
   | Shape | MKL raw | DNNL raw | DNNL raw slowdown | MKL Python overhead | DNNL Python overhead |
   | --- | --- | --- | --- | --- | --- |
   | 5, 512, 512 | 10.29 | 31.61 | 3.07x | 66.21 | 63.69 | 
   | 5, 512, 1536 | 17.27 | 90.26 | 5.22x | 68.23 | 78.44 | 
   | 5, 512, 2048 | 16.16 | 81.48 | 5.04x | 75.84 | 84.92 | 
   | 5, 2048, 512 | 16.84 | 97.20 | 5.77x | 77.46 | 91.70 | 
   | 4, 512, 512 | 8.74 | 15.16 | 1.73x | 69.26 | 79.84 | 
   
   
   Raw is defined by the time reported from verbose output; note I converted DNNL ms to microseconds.  
   Python overhead is defined as time reported by the above Python script (converted to microseconds) minus raw time.  Note that MKL used `dot` whilst DNNL used `FullyConnected`.  
   
   So we really have four issues:
   1. Raw speed as reported by MKL's verbose output is much faster than raw speed reported by DNNL's verbose output.  This is definitely a speed problem with DNNL.  
   2. In my opinion, when I compile with `-DBLAS` set to something, it should use that BLAS.  You can add an `auto` option that picks according to the priorities on the website.  
   3. It's weird that `dot` calls MKL but `FullyConnected` calls DNNL given that they are both GEMMs.  I understand @pengzhao-intel is working on this.  Though unless 1 and 2 above are fixed, it will only make performance worse.  
   4. MXNet / Python overhead is larger for `FullyConnected` than `dot`.  This is small compared to the DNNL vs MKL difference though.  We can also  question why there is so much overhead relative to the cost of the multiply.  
   
   Intel should fix DNNL but that will take time.  
   
   MXNet should make `-DBLAS` actually choose the BLAS routine both to support a short-term bypass and because it better matches user expectations of what a compilation flag does.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] pengzhao-intel commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

pengzhao-intel commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-610706449
 
 
   Thanks @emfomenk & @TaoLv to identify the issue.  Our team will follow up on the issue and switch more OP to DNNL soon.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] kpuatamazon edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpuatamazon edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609875547
 
 
   This still uses MKLDNN for `FullyConnected`
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DBLAS=MKL -DUSE_MKL_IF_AVAILABLE=ON ..
   ```
   I was able to get MKL to run by disabling MKLDNN entirely:
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DBLAS=MKL -DUSE_MKL_IF_AVAILABLE=ON -DUSE_MKLDNN=OFF ..
   ```
   But then I lose other kernels like DNNL softmax and actually Sockeye randomly crashes:
   ```
   mxnet.base.MXNetError: MXNetError: Out of range value for value, value='inf', in operator _full(name="", dtype="float32", value="inf", ctx="cpu(0)", shape="(5, 1)")
   ```
   (bizarrely this happens after it translated many sentences with that same code.)
   
   How do I achieve the fastest combination of DNNL softmax and MKL matrix multiply for `FullyConnected` using only documented options?  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] kpuatamazon edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpuatamazon edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609843360
 
 
   Here's some benchmarks on 2fff11d4233814aa4ad07858779338090ec2132d (current tip of master) with the same Skylake c5.9xlarge.  DNNL is substantially slower than MKL. 
   
   (Which DNNL is in master?)
   
   ```python
   #!/usr/bin/env python3
   import mxnet as mx
   import time
   
   def time_procedure(shape, count, proc):
     rows, inner, cols = shape
     a = mx.nd.random_uniform(shape=(rows, inner), low=-1.0, high=1.0)
     b = mx.nd.random_uniform(shape=(cols, inner), low=-1.0, high=1.0)
     # Burn in
     proc(a, b, cols)
     mx.nd.waitall()
     begin = time.time()
     for i in range(0, count):
       proc(a, b, cols)
       mx.nd.waitall()
     return (time.time() - begin) / count
   
   shapes = [(5, 512, 512), (5,512,1536), (5,512,2048), (5,2048,512), (4,512,512)]
   count = 1000
   
   procedures = {
     "fullyconnected (DNNL)" : (lambda a, b, cols : mx.nd.FullyConnected(a, b, no_bias=True, num_hidden=cols)),
     "dot (MKL)" : (lambda a, b, cols : mx.nd.dot(a, b, transpose_b = True))
   }
   for s in shapes:
     print("Shape " + str(s))
     stats = {}
     for name, l in procedures.items():
       stats[name] = time_procedure(s, count, l)
       print("{:.7f} seconds for {}".format(stats[name], name))
   ```
   Run as `OMP_NUM_THREADS=4 ./mult_bench.py`:
   ```
   Shape (5, 512, 512)
   0.0000961 seconds for fullyconnected (DNNL)
   0.0000509 seconds for dot (MKL)
   Shape (5, 512, 1536)
   0.0002011 seconds for fullyconnected (DNNL)
   0.0000735 seconds for dot (MKL)
   Shape (5, 512, 2048)
   0.0002521 seconds for fullyconnected (DNNL)
   0.0001027 seconds for dot (MKL)
   Shape (5, 2048, 512)
   0.0003569 seconds for fullyconnected (DNNL)
   0.0001018 seconds for dot (MKL)
   Shape (4, 512, 512)
   0.0000946 seconds for fullyconnected (DNNL)
   0.0000496 seconds for dot (MKL)
   ```
   
   I don't really mind what the default BLAS implementation is.  But choosing to use MKL should be 1) atomic and 2) not require undocumented compilation options.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] emfomenk commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

emfomenk commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-612837674
 
 
   Summoning  @aaraujom

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] TaoLv commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

TaoLv commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-610157462
 
 
   @emfomenk What's `-D_DNNL_USE_MKL=FULL` for? I'm not aware of this option and guessing it's not a user facing option.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609859032
 
 
   I also tried the latest DNNL:
   ```bash
   cd 3rdparty
   rm -rf mkldnn
   git clone https://github.com/oneapi-src/oneDNN mkldnn
   cd ../build && ninja -j 32
   ```
   and DNNL is still slower:
   ```
   Shape (5, 512, 512)
   0.0000875 seconds for fullyconnected (DNNL)
   0.0000507 seconds for dot (MKL)
   Shape (5, 512, 1536)
   0.0002005 seconds for fullyconnected (DNNL)
   0.0000729 seconds for dot (MKL)
   Shape (5, 512, 2048)
   0.0002516 seconds for fullyconnected (DNNL)
   0.0000974 seconds for dot (MKL)
   Shape (5, 2048, 512)
   0.0003564 seconds for fullyconnected (DNNL)
   0.0000981 seconds for dot (MKL)
   Shape (4, 512, 512)
   0.0000928 seconds for fullyconnected (DNNL)
   0.0000497 seconds for dot (MKL)
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] kpuatamazon edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpuatamazon edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-612874407
 
 
   @TaoLv Regarding your comment
   > how to improve the performance of FullyConnected when DNNL is used. In fact, I see the performance of DNNL primitive kernel is good, compared with the output of the script. 
   
   I ran with both MKL and DNNL verbose options: `OMP_NUM_THREADS=1 MKL_VERBOSE=1 DNNL_VERBOSE=1`
   
   [verbose.txt](https://github.com/apache/incubator-mxnet/files/4469462/verbose.txt)
   
   To summarize the verbose output in microseconds:
   | Shape | MKL raw | DNNL raw | DNNL raw slowdown | MKL Python overhead | DNNL Python overhead |
   | --- | --- | --- | --- | --- | --- |
   | 5, 512, 512 | 55.71 | 213.19 | 3.83x | 53.19 | 61.61 | 
   | 5, 512, 1536 | 260.18 | 636.10 | 2.44x | 55.82 | 62.90 | 
   | 5, 512, 2048 | 345.66 | 844.64 | 2.44x | 56.74 | 65.66 | 
   | 5, 2048, 512 | 328.68 | 1346.32 | 4.10x | 56.62 | 187.78 | 
   | 4, 512, 512 | 53.75 | 202.27 | 3.76x | 51.45 | 60.03 | 
   
   Raw is defined by the time reported from verbose output; note I converted DNNL ms to microseconds.  
   Python overhead is defined as time reported by the above Python script (converted to microseconds) minus raw time.  Note that MKL used `dot` whilst DNNL used `FullyConnected`.  
   
   In both cases, I excluded the first call for each shape to allow for JIT/burn-in.  
   
   So we really have four issues:
   1. Raw speed as reported by MKL's verbose output is much faster than raw speed reported by DNNL's verbose output.  This is definitely a speed problem with DNNL.  
   2. In my opinion, when I compile with `-DBLAS` set to something, it should use that BLAS.  You can add an `auto` option that picks according to the priorities on the website.  
   3. It's weird that `dot` calls MKL but `FullyConnected` calls DNNL given that they are both GEMMs.  I understand @pengzhao-intel is working on this.  Though unless 1 and 2 above are fixed, it will only make performance worse.  
   4. MXNet / Python overhead is larger for `FullyConnected` than `dot`.  This is small compared to the DNNL vs MKL difference though.  We can also  question why there is so much overhead relative to the cost of the multiply.  
   
   Intel should fix DNNL but that will take time.  
   
   MXNet should make `-DBLAS` actually choose the BLAS routine both to support a short-term bypass and because it better matches user expectations of what a compilation flag does.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] TaoLv edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

TaoLv edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609885325

> How do I achieve the fastest combination of DNNL softmax and MKL matrix multiply for FullyConnected using only documented options?

I don't think it's possible at this moment. I think the real question is how to improve the performance of FullyConnected when DNNL is used. In fact, I see the performance of DNNL primitive kernel is good, compared with the output of the script. Take the last shape as an example (on my machine, not ec2):
```
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb4ic512oc512,0.0490723
0.0001380 seconds for fullyconnected (DNNL)
0.0000556 seconds for dot (MKL)
```

You can set DNNL_VERBOSE=1 when running the script to get the verbose output.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

[GitHub] [incubator-mxnet] kpuatamazon edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpuatamazon edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-612874407
 
 
   @TaoLv Regarding your comment
   > how to improve the performance of FullyConnected when DNNL is used. In fact, I see the performance of DNNL primitive kernel is good, compared with the output of the script. 
   
   I ran with both MKL and DNNL verbose options: `OMP_NUM_THREADS=1 MKL_VERBOSE=1 DNNL_VERBOSE=1`
   
   [verbose.txt](https://github.com/apache/incubator-mxnet/files/4469462/verbose.txt)
   
   To summarize the verbose output in microseconds:
   | Shape | MKL raw | DNNL raw | DNNL raw slowdown | MKL Python overhead | DNNL Python overhead |
   | --- | --- | --- | --- | --- | --- |
   | 5, 512, 512 | 55.71 | 213.19 | 3.83x | 53.19 | 61.61 | 
   | 5, 512, 1536 | 260.18 | 636.10 | 2.44x | 55.82 | 62.90 | 
   | 5, 512, 2048 | 345.66 | 844.64 | 2.44x | 56.74 | 65.66 | 
   | 5, 2048, 512 | 328.68 | 1346.32 | 4.10x | 56.62 | 187.78 | 
   | 4, 512, 512 | 53.75 | 202.27 | 3.76x | 51.45 | 60.03 | 
   
   Raw is defined by the time reported from verbose output; note I converted DNNL ms to microseconds.  
   Python overhead is defined as time reported by the above Python script (converted to microseconds) minus raw time.  Note that MKL used `dot` whilst DNNL used `FullyConnected`.  
   
   In both cases, I excluded the first call for each shape to allow for JIT/burn-in.  
   
   So we really have four issues:
   1. Raw speed as reported by MKL's verbose output is much faster than raw speed reported by DNNL's verbose output.  This is definitely a speed problem with DNNL.  
   2. No documented option to compile MKLDNN's kernels with MKL's GEMM.  In my opinion, when I compile with `-DBLAS` set to something, it should use that BLAS.  You can add an `auto` option that picks according to the priorities on the website.  
   3. It's weird that `dot` calls MKL but `FullyConnected` calls DNNL given that they are both GEMMs.  I understand @pengzhao-intel is working on this.  Though unless 1 and 2 above are fixed, it will only make performance worse.  
   4. MXNet / Python overhead is larger for `FullyConnected` than `dot`.  This is small compared to the DNNL vs MKL difference though.  We can also  question why there is so much overhead relative to the cost of the multiply.  
   
   Intel should fix DNNL but that will take time.  
   
   MXNet should make `-DBLAS` actually choose the BLAS routine both to support a short-term bypass and because it better matches user expectations of what a compilation flag does.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] ChaiBapchya commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

ChaiBapchya commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-629385677


   > Tested with MXNet [cfb474b](https://github.com/apache/incubator-mxnet/commit/cfb474ba743d5ea85161bf19875488f4cb409d3c). Compiled with mostly-default cmake settings:
   > 
   > ```shell
   > cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
   > ```
   > 
   > Then when I run
   > 
   > ```
   > export MKL_VERBOSE=1
   > export MKLDNN_VERBOSE=1
   > python3
   > Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
   > [GCC 8.3.0] on linux
   > Type "help", "copyright", "credits" or "license" for more information.
   > >>> import mxnet as mx
   > Numpy + Intel(R) MKL: THREADING LAYER: (null)
   > Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
   > Numpy + Intel(R) MKL: preloading libiomp5.so runtime
   > ```
   
   Running on Ubuntu 18.04 [which doesn't have MKL installed by default] with default cmake config doesn't use MKL as blas.
   Hence we can't get the above exports.
   
   Thus for Ubuntu 18.04 base AMI, one has to install MKL in /opt/intel & update the cmake command to
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
   ```
   This I found uses mkl as blas & export MKL_VERBOSE=1 confirms it.
   
   With this addition to both [default & workaround] I reran the opperf & I didn't see much perf differences.
   
   Default
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
   ```
   
   Workaround
   ```
   export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
   ```
   
   Results
    Operator  	| LHS            	| RHS            	| MKL Default 	| MKL Workaround 	|
   |-----------	|----------------	|----------------	|-------------	|----------------	|
   | Dot       	| (4, 512, 512)  	| (4, 512, 512)  	| 4.1112      	| 4.8241         	|
   |           	| (5, 512, 512)  	| (5, 512, 512)  	| 6.4421      	| 7.607          	|
   |           	| (5, 512, 1536  	| (5, 512, 1536) 	| 20.3648     	| 19.2217        	|
   |           	| (5, 512, 2048) 	| (5, 512, 2048) 	| 23.3236     	| 23.2849        	|
   |           	| (5, 2048, 512) 	| (5, 2048, 512) 	| 123.1235    	| 123.9806       	|
   |           	|                	|                	|             	|                	|
   | Batch_dot 	| (4, 512, 512)  	| (4, 512, 512)  	| 1.4105      	| 1.407          	|
   |           	| (5, 512, 512)  	| (5, 512, 512)  	| 1.7558      	| 1.7511         	|
   |           	| (5, 512, 1536) 	| (5, 512, 1536) 	| 6.5931      	| 6.5585         	|
   |           	| (5, 512, 2048) 	| (5, 512, 2048) 	| 9.1452      	| 9.1031         	|
   |           	| (5, 2048, 512) 	| (5, 2048, 512) 	| 29.0192     	| 28.9236        	|
   
    Operator  	| Data            	| Weight            	| MKL Default 	| MKL Workaround 	|
   |-----------	|----------------	|----------------	|-------------	|----------------	|
   | FC        	| (4, 512)       	| (512, 512)     	| 0.057       	| 0.0685         	|
   |           	| (5, 512)       	| (512, 512)     	| 0.0591      	| 0.0698         	|
   |           	| (5, 512)       	| (1536, 512)    	| 0.0823      	| 0.0939         	|
   |           	| (5, 512)       	| (2048, 512)    	| 0.0916      	| 0.1026         	|
   |           	| (5, 2048)      	| (512, 2048)    	| 0.1146      	| 0.1267         	|


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] leezu commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

leezu commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-623736430


   Does DNNL always link to MKL by default if MKL is available on the system, just as MXNet does link to MKL in such situation?
   
   Further, regarding the double-linking issue, https://github.com/apache/incubator-mxnet/pull/17794 should fix it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] aaraujom commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

aaraujom commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-637876815


   Hi @ChaiBapchya - I was able to reproduce issue on my end. DNNL is missing some kernels for very specific problem sizes affecting the gemm size in this thread. We will address this issue in DNNL.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] aaraujom commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

aaraujom commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-629542494


   Hi thanks for checking. Let me try to reproduce this in my end and fix it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpuatamazon commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-674946425


   Mostly fixed by the v1.6.x release of oneDNN.  There is still a speed discrepancy but it's not as bad.  
   
   `git checkout 6b568fd2c10e8cc14c21d6150ba71933a2af2a41` (current v1.x branch of MXNet including oneDNN 1.6.1)
   ```
   Shape (5, 512, 512)
   0.0000471 seconds for fullyconnected (DNNL)
   0.0000413 seconds for dot (MKL)
   Shape (5, 512, 1536)
   0.0000691 seconds for fullyconnected (DNNL)
   0.0000639 seconds for dot (MKL)
   Shape (5, 512, 2048)
   0.0000936 seconds for fullyconnected (DNNL)
   0.0000868 seconds for dot (MKL)
   Shape (5, 2048, 512)
   0.0000949 seconds for fullyconnected (DNNL)
   0.0000870 seconds for dot (MKL)
   Shape (4, 512, 512)
   0.0000446 seconds for fullyconnected (DNNL)
   0.0000392 seconds for dot (MKL)
   ```
   
   Revert to oneDNN v1.3.  
   `git checkout 6b568fd2c10e8cc14c21d6150ba71933a2af2a41 && git revert 17111037176294d86b7dc6f719347aa24c96b317 && git revert eae6171cb56190cf06f5b11597380c082745991b`
   ```
   Shape (5, 512, 512)
   0.0000842 seconds for fullyconnected (DNNL)
   0.0000407 seconds for dot (MKL)
   Shape (5, 512, 1536)
   0.0001908 seconds for fullyconnected (DNNL)
   0.0000606 seconds for dot (MKL)
   Shape (5, 512, 2048)
   0.0002421 seconds for fullyconnected (DNNL)
   0.0000880 seconds for dot (MKL)
   Shape (5, 2048, 512)
   0.0003462 seconds for fullyconnected (DNNL)
   0.0000912 seconds for dot (MKL)
   Shape (4, 512, 512)
   0.0000841 seconds for fullyconnected (DNNL)
   0.0000395 seconds for dot (MKL)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpuatamazon commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-626563301


   @ChaiBapchya to be clear, here's how I am building the second option:
   ```bash
   export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
   unset LD_PRELOAD  #Technically this should be what exists in your environment by default
   rm -rf build
   mkdir build
   cd build
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
   ninja -j 30
   ```
   Note that `cmake` does not appear to pick up on changes to the build so it needs a fresh build directory (deleting the cache might work).  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] pengzhao-intel commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

pengzhao-intel commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-703353834


   > @pengzhao-intel v1.7.0 public mxnet uses MKLDNN v1.3 and hence won't have this optimization right?
   > https://github.com/apache/incubator-mxnet/tree/1.7.0/3rdparty
   > This means public mxnet v1.7 won't have this fix. v1.8, on the contrary, would have the optimized mkldnn [by virtue of v1.x branch having mkldnn v1.6+ prior to 1.8.rc0] right?
   
   Thanks to raise the question. China team is on a public holiday this week. 
   @anko-intel could you help take a look at the issue?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] pengzhao-intel commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

pengzhao-intel commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-683539078


   closing since the fix is updated with the new MKLDNN.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

ChaiBapchya commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-629526785


   Can confirm that this issue is specific to AVX512 kernels.
   Tried this on c5.xl 
   $ lscpu
   ```
   Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
   Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
   ```
   ## Results
   ### Default [slower]
   ```
   dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.133789
   dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.132812
   ```
   ```
   [{'FullyConnected': [
   {'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.10202302001744101, 'p50_time_FullyConnected': 0.10086749989568489, 'p90_time_FullyConnected': 0.10658760029400582, 'p99_time_FullyConnected': 0.13521948004836298}, 
   {'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.10642346004715364, 'p50_time_FullyConnected': 0.09991750016524747, 'p90_time_FullyConnected': 0.10565369971118344, 'p99_time_FullyConnected': 0.2586996700802042}, 
   {'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.16890607999812346, 'p50_time_FullyConnected': 0.16431500012004108, 'p90_time_FullyConnected': 0.1781331999154645, 'p99_time_FullyConnected': 0.2831235897247094}, 
   {'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.20140223995440465, 'p50_time_FullyConnected': 0.19778950013460417, 'p90_time_FullyConnected': 0.20401089991537447, 'p99_time_FullyConnected': 0.3063294199228036}, 
   {'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.21596427998701984, 'p50_time_FullyConnected': 0.2096700000038254, 'p90_time_FullyConnected': 0.21819640001012885, 'p99_time_FullyConnected': 0.3412436299549877}]}]
   ```
   ### MKL Workaround [Faster]
   ```
   MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f9bcf6fac28,0x7f9bc22f4040,2048,0x7f9b1400ce80,2048,0x7f9bcf6fac30,0x7f9b1405e840,512) 21.25us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:18
   dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0378418
   MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f9bcf6fac28,0x7f9bc22f4040,2048,0x7f9b1400ce80,2048,0x7f9bcf6fac30,0x7f9b14061c00,512) 20.94us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:18
   dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0371094
   ```
   ```
   [{'FullyConnected': [
   {'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.11772135999308375, 'p50_time_FullyConnected': 0.1149684999290912, 'p90_time_FullyConnected': 0.1244978000613628, 'p99_time_FullyConnected': 0.14825501980340045}, 
   {'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.120828840035756, 'p50_time_FullyConnected': 0.11370450010872446, 'p90_time_FullyConnected': 0.12752780021401122, 'p99_time_FullyConnected': 0.2412066401620902}, 
   {'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.13385597998421872, 'p50_time_FullyConnected': 0.12600750005731243, 'p90_time_FullyConnected': 0.14806160011175962, 'p99_time_FullyConnected': 0.2509373301927551}, 
   {'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.14175208003507578, 'p50_time_FullyConnected': 0.1372545000322134, 'p90_time_FullyConnected': 0.14401020002878798, 'p99_time_FullyConnected': 0.2423993399725075}, 
   {'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.143890859962994, 'p50_time_FullyConnected': 0.1397979999637755, 'p90_time_FullyConnected': 0.14637689982919258, 'p99_time_FullyConnected': 0.22678783964693117}]}]
   ```
   
   To reproduce
   https://gist.github.com/ChaiBapchya/a849cfd566b8114e695454850b48077b
   https://gist.github.com/ChaiBapchya/5f2342f75ddeb1e21f14acac665c76ad#file-benchmark_intel_mkl-py


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpuatamazon commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-623737885


   DNNL has only a hidden unsupported option to link to MKL.  It will not link to MKL by default.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] b-shi commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

b-shi commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-675738532


   Hi, we benchmarked gemm performance for those sizes internally using oneDNN vs oneMKL directly and here is what we got:
   
   | Max (20K trials) GFlops | oneMKL (1Th) | oneDNN (1Th) | oneMKL (4Th) | oneDNN (4Th) | oneMKL (8Th) | oneDNN (8Th) | oneMKL (28Th) | oneDNN (28Th) |
   |-------------------------|--------------|--------------|--------------|--------------|--------------|--------------|---------------|---------------|
   | 5, 512, 512               |        34.49 |        40.96 |       174.76 |       262.14 |       291.27 |       436.91 |        374.49 |        524.29 |
   | 5, 512, 1536              |        26.57 |        27.40 |       178.73 |       231.30 |       357.47 |       491.52 |        561.74 |        873.81 |
   | 5, 512, 2048              |        25.83 |        26.55 |       143.64 |       174.76 |       338.25 |       436.91 |        616.81 |       1048.58 |
   | 5, 2048, 512              |        25.51 |        26.82 |       139.81 |       183.96 |       388.36 |       524.29 |        873.81 |       1310.72 |
   | 4, 512, 512               |        31.78 |        42.80 |       161.32 |       233.02 |       262.14 |       419.43 |        299.59 |        419.43 |
   
   | Avg (20K trials) GFlops | oneMKL (1Th) | oneDNN (1Th) | oneMKL (4Th) | oneDNN (4Th) | oneMKL (8Th) | oneDNN (8Th) | oneMKL (28Th) | oneDNN (28Th) |
   |-------------------------|--------------|--------------|--------------|--------------|--------------|--------------|---------------|---------------|
   | 5, 512, 512               |        33.96 |        39.45 |       169.05 |       233.88 |       258.32 |       364.93 |        328.68 |        399.95 |
   | 5, 512, 1536              |        26.30 |        27.13 |       174.04 |       226.75 |       342.51 |       463.81 |        492.14 |        765.09 |
   | 5, 512, 2048              |        25.58 |        26.35 |       140.57 |       171.42 |       328.70 |       429.25 |        589.81 |        873.36 |
   | 5, 2048, 512              |        25.27 |        26.57 |       137.35 |       179.91 |       368.89 |       508.38 |        792.29 |       1101.67 |
   | 4, 512, 512               |        31.02 |        41.91 |       146.59 |       216.20 |       226.68 |       335.54 |        271.98 |        334.98 |
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] aaraujom commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

aaraujom commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-674996470


   @ChaiBapchya - Will those improvements work for you?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

ChaiBapchya edited a comment on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-623803913


   > In case somebody finds this issue and wants their optimized build, here is a different workaround that removes the need for `LD_PRELOAD`. Just do this before running cmake the first time:
   > 
   > ```shell
   > export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
   > ```
   > 
   > Then `cmake` can be run normally:
   > 
   > ```shell
   > cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
   > ```
   > 
   > and the compiled MXNet can be run normally without any special environment variables.
   
   @kpuatamazon Hi I was trying to benchmark using opperf for mkl [default] vs workaround
   And despite ensuring mkl is installed & using export CXXFlags followed by usual cmake command, build failed with 
   ```
   gemm.cpp:(.text+0xb45): undefined reference to `cblas_gemm_s8u8s32'
   ```
   
   I tried the undocumented abominable kludge option you mentioned and that worked smoothly.
   ```
   export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
   rm -rf build/
   mkdir -p build && cd build
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..
   cmake --build . --parallel 1024
   ```
   
   Script for OpPerf : https://gist.github.com/ChaiBapchya/5f2342f75ddeb1e21f14acac665c76ad
   
   Results
   | Operator  	| LHS            	| RHS            	| MKL Default 	| MKL Workaround 	|
   |-----------	|----------------	|----------------	|-------------	|----------------	|
   | Dot       	| (4, 512, 512)  	| (4, 512, 512)  	| 15.1122     	| 4.1254         	|
   |           	| (5, 512, 512)  	| (5, 512, 512)  	| 38.1678     	| 7.5323         	|
   |           	| (5, 512, 1536  	| (5, 512, 1536) 	| 21.6601     	| 19.2503        	|
   |           	| (5, 512, 2048) 	| (5, 512, 2048) 	| 29.0369     	| 23.7432        	|
   |           	| (5, 2048, 512) 	| (5, 2048, 512) 	| 167.5528    	| 129.9957       	|
   |           	|                	|                	|             	|                	|
   | Batch_dot 	| (4, 512, 512)  	| (4, 512, 512)  	| 1.7898      	| 1.5445         	|
   |           	| (5, 512, 512)  	| (5, 512, 512)  	| 2.2457      	| 1.9361         	|
   |           	| (5, 512, 1536) 	| (5, 512, 1536) 	| 6.1453      	| 5.4034         	|
   |           	| (5, 512, 2048) 	| (5, 512, 2048) 	| 8.246       	| 8.0442         	|
   |           	| (5, 2048, 512) 	| (5, 2048, 512) 	| 160.6243    	| 29.0772        	|
   |           	|                	|                	|             	|                	|
   | FC        	| (4, 512)       	| (512, 512)     	| 0.0609      	| 0.068          	|
   |           	| (5, 512)       	| (512, 512)     	| 0.0633      	| 0.0731         	|
   |           	| (5, 512)       	| (1536, 512)    	| 0.0916      	| 0.0996         	|
   |           	| (5, 512)       	| (2048, 512)    	| 0.1081      	| 0.1084         	|
   |           	|                	|                	|             	|                	|
   
   However @kpuatamazon when I try to test out with default [i.e. default -> workaround -> default] by unsetting the environment variable LD_PRELOAD, it failed to build default with `gemm.cpp:(.text+0xe6b): undefined reference to `cblas_gemm_s8u8s32'`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

ChaiBapchya commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-703318878


   @pengzhao-intel v1.7.0 public mxnet uses MKLDNN v1.3 and hence won't have this optimization right? 
   https://github.com/apache/incubator-mxnet/tree/1.7.0/3rdparty
   This means public mxnet v1.7 won't have this fix. v1.8, on the contrary, would have the optimized mkldnn [by virtue of v1.x branch having mkldnn v1.6+ prior to 1.8.rc0] right?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

ChaiBapchya commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-629477511


   Yes logs agree with these statements. But the perf difference isn't visible via opperf.
   My bad, lhs, rhs wrongly interpreted. But that's for dot,batch_dot. FC values for data, weight are same as yours.
   
   CPU : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
   Instance p3.8xl
   
   This instance doesn't have avx512
   ```
   $ lscpu | grep Flags | grep avx
   Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
   ``` 
   
   OMP_Threading i used was default. 
   
   However trying FC with OMP_NUM_THREADS=4, still shows that workaround makes FC slower [despite it using MKL] than default[which doesn't use MKL]
   
   ### Workaround [slow]
   ```
   MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f0b79ff8c28,0x7f0b62cfb040,2048,0x7f0b6800b740,2048,0x7f0b79ff8c30,0x7f0b68074e80,512) 119.93us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:4
   dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.142822
   ```
   ```
   {'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0925}, 
   {'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0921}, 
   {'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_forward_FullyConnected': 0.1441}, 
   {'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_forward_FullyConnected': 0.1688}, 
   {'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.1773}]}]
   ```
   ### Default [faster]
   ```
   dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.11499
   dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.115967
   ```
   ```
   {'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0691}, 
   {'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0669}, 
   {'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_forward_FullyConnected': 0.1166}, 
   {'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048},  'avg_time_forward_FullyConnected': 0.1438}, 
   {'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.1509}]}]
   ```
   
   ## Python's time-it function in OpPerf
   Also, instead of using MXNet's native [built-in] profiler, if i use python time-it function still
   ### Workaround [slow]
   ```
   MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f86067f9c28,0x7f85e34fd040,2048,0x7f85f000b740,2048,0x7f86067f9c30,0x7f85f005fc80,512) 120.56us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:4
   dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.143066
   ```
   ```
   [{'FullyConnected': [
   {'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.18504947423934937, 'p50_time_FullyConnected': 0.1817569718696177, 'p90_time_FullyConnected': 0.19327133195474744, 'p99_time_FullyConnected': 0.21545797004364425}, 
   {'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.1922117848880589, 'p50_time_FullyConnected': 0.18283800454810262, 'p90_time_FullyConnected': 0.19954713061451912, 'p99_time_FullyConnected': 0.36291924072429527}, 
   {'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.24743830785155296, 'p50_time_FullyConnected': 0.23927190341055393, 'p90_time_FullyConnected': 0.25965459644794464, 'p99_time_FullyConnected': 0.3613424429204313}, 
   {'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.2728361077606678, 'p50_time_FullyConnected': 0.2635499695315957, 'p90_time_FullyConnected': 0.2853184472769499, 'p99_time_FullyConnected': 0.39780672872439016}, 
   {'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.2834270987659693, 'p50_time_FullyConnected': 0.27543300529941916, 'p90_time_FullyConnected': 0.29397032922133803, 'p99_time_FullyConnected': 0.40460130083374674}]}]
   ```
   ### Default [faster]
   ```
   dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.116943
   dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.115967
   ```
   ```
   [{'FullyConnected': [
   {'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.17584662651643157, 'p50_time_FullyConnected': 0.17386197578161955, 'p90_time_FullyConnected': 0.18297710921615362, 'p99_time_FullyConnected': 0.2026369632221758}, 
   {'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.177996635902673, 'p50_time_FullyConnected': 0.17414247849956155, 'p90_time_FullyConnected': 0.18130375538021326, 'p99_time_FullyConnected': 0.23936283076181997}, 
   {'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.22948215948417783, 'p50_time_FullyConnected': 0.22402300965040922, 'p90_time_FullyConnected': 0.2324321656487882, 'p99_time_FullyConnected': 0.3020400775130837}, 
   {'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.2527528256177902, 'p50_time_FullyConnected': 0.24718954227864742, 'p90_time_FullyConnected': 0.263673288282007, 'p99_time_FullyConnected': 0.3303098364267497}, 
   {'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.26115449611097574, 'p50_time_FullyConnected': 0.25852309772744775, 'p90_time_FullyConnected': 0.26592351496219635, 'p99_time_FullyConnected': 0.304172352189198}]}]
   ```
   
   Since Skylake has AVX512. I'm guessing this FC slowdown might be in AVX512 kernel. Let me reproduce that.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

ChaiBapchya commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-637751126


   @aaraujom any update?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] TaoLv commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

TaoLv commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-676507893


   @ChaiBapchya Is it possible to compare the performance of FC in two versions? Comparing FC with dot is not so fair to me. As you may know, they are using different execution paths, FCompute vs. FComputeEx.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

ChaiBapchya commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-675602405


   Is there a way to bring down the deficit and have parity? or do we know the reason why DNNL is slower? @TaoLv @aaraujom 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

ChaiBapchya edited a comment on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-629385677


   > Tested with MXNet [cfb474b](https://github.com/apache/incubator-mxnet/commit/cfb474ba743d5ea85161bf19875488f4cb409d3c). Compiled with mostly-default cmake settings:
   > 
   > ```shell
   > cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
   > ```
   > 
   > Then when I run
   > 
   > ```
   > export MKL_VERBOSE=1
   > export MKLDNN_VERBOSE=1
   > python3
   > Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
   > [GCC 8.3.0] on linux
   > Type "help", "copyright", "credits" or "license" for more information.
   > >>> import mxnet as mx
   > Numpy + Intel(R) MKL: THREADING LAYER: (null)
   > Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
   > Numpy + Intel(R) MKL: preloading libiomp5.so runtime
   > ```
   
   @kpuatamazon @kpu
   Running on Ubuntu 18.04 [which doesn't have MKL installed by default] with default cmake config doesn't use MKL as blas.
   Hence we can't get the above exports.
   
   Thus for Ubuntu 18.04 base AMI, one has to install MKL in /opt/intel & update the cmake command to
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
   ```
   This I found uses mkl as blas & export MKL_VERBOSE=1 confirms it.
   
   With this addition to both [default & workaround] I reran the opperf & I didn't see much perf differences.
   
   Default
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
   ```
   
   Workaround
   ```
   export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
   ```
   
   Results
    Operator  	| LHS            	| RHS            	| MKL Default 	| MKL Workaround 	|
   |-----------	|----------------	|----------------	|-------------	|----------------	|
   | Dot       	| (4, 512, 512)  	| (4, 512, 512)  	| 4.1112      	| 4.8241         	|
   |           	| (5, 512, 512)  	| (5, 512, 512)  	| 6.4421      	| 7.607          	|
   |           	| (5, 512, 1536  	| (5, 512, 1536) 	| 20.3648     	| 19.2217        	|
   |           	| (5, 512, 2048) 	| (5, 512, 2048) 	| 23.3236     	| 23.2849        	|
   |           	| (5, 2048, 512) 	| (5, 2048, 512) 	| 123.1235    	| 123.9806       	|
   |           	|                	|                	|             	|                	|
   | Batch_dot 	| (4, 512, 512)  	| (4, 512, 512)  	| 1.4105      	| 1.407          	|
   |           	| (5, 512, 512)  	| (5, 512, 512)  	| 1.7558      	| 1.7511         	|
   |           	| (5, 512, 1536) 	| (5, 512, 1536) 	| 6.5931      	| 6.5585         	|
   |           	| (5, 512, 2048) 	| (5, 512, 2048) 	| 9.1452      	| 9.1031         	|
   |           	| (5, 2048, 512) 	| (5, 2048, 512) 	| 29.0192     	| 28.9236        	|
   
    Operator  	| Data            	| Weight            	| MKL Default 	| MKL Workaround 	|
   |-----------	|----------------	|----------------	|-------------	|----------------	|
   | FC        	| (4, 512)       	| (512, 512)     	| 0.057       	| 0.0685         	|
   |           	| (5, 512)       	| (512, 512)     	| 0.0591      	| 0.0698         	|
   |           	| (5, 512)       	| (1536, 512)    	| 0.0823      	| 0.0939         	|
   |           	| (5, 512)       	| (2048, 512)    	| 0.0916      	| 0.1026         	|
   |           	| (5, 2048)      	| (512, 2048)    	| 0.1146      	| 0.1267         	|


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

ChaiBapchya edited a comment on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-629385677


   > Tested with MXNet [cfb474b](https://github.com/apache/incubator-mxnet/commit/cfb474ba743d5ea85161bf19875488f4cb409d3c). Compiled with mostly-default cmake settings:
   > 
   > ```shell
   > cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
   > ```
   > 
   > Then when I run
   > 
   > ```
   > export MKL_VERBOSE=1
   > export MKLDNN_VERBOSE=1
   > python3
   > Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
   > [GCC 8.3.0] on linux
   > Type "help", "copyright", "credits" or "license" for more information.
   > >>> import mxnet as mx
   > Numpy + Intel(R) MKL: THREADING LAYER: (null)
   > Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
   > Numpy + Intel(R) MKL: preloading libiomp5.so runtime
   > ```
   
   @kpuatamazon @kpu
   Running on Ubuntu 18.04 [which doesn't have MKL installed by default] with default cmake config doesn't use MKL as blas.
   Hence we can't get the above exports.
   
   Thus for Ubuntu 18.04 base AMI, one has to install MKL in /opt/intel & update the cmake command to
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
   ```
   This I found uses mkl as blas & export MKL_VERBOSE=1 confirms it.
   
   With this addition to both [default & workaround] I reran the opperf & I didn't see much perf differences.
   
   ## Commands
   Default
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
   ```
   
   Workaround
   ```
   export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
   ```
   
   ## Logs
   #### Default
   Batch_dot
   ```
   MKL_VERBOSE SGEMM_BATCH(T,N,0x7fdafe19ecec,0x7fdafe19ecf0,0x7fdafe19ecf4,0x7fdafe19ed04,0x7fda6001dfd0,0x7fdafe19ecf8,0x7fda6001e490,0x7fdafe19ecfc,0x7fdafe19ed08,0x3fd7ec0,0x7fdafe19ed00,0x7fdafe19ec28,0x7fdafe 28.71ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
   MKL_VERBOSE SGEMM_BATCH(T,N,0x7fdafe19ecec,0x7fdafe19ecf0,0x7fdafe19ecf4,0x7fdafe19ed04,0x7fda6001dfd0,0x7fdafe19ecf8,0x7fda6001e490,0x7fdafe19ecfc,0x7fdafe19ed08,0x3fd7ec0,0x7fdafe19ed00,0x7fdafe19ec28,0x7fdafe 28.53ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
   ```
   FC
   ```
   dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0551758
   dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0559082
   ```
   
   #### Workaround
   Batch_dot [same as default]
   ```
   MKL_VERBOSE SGEMM_BATCH(T,N,0x7f985b78acec,0x7f985b78acf0,0x7f985b78acf4,0x7f985b78ad04,0x7f97b4016cd0,0x7f985b78acf8,0x7f97b401e550,0x7f985b78acfc,0x7f985b78ad08,0x26f2890,0x7f985b78ad00,0x7f985b78ac28,0x7f985b 28.72ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
   MKL_VERBOSE SGEMM_BATCH(T,N,0x7f985b78acec,0x7f985b78acf0,0x7f985b78acf4,0x7f985b78ad04,0x7f97b4016cd0,0x7f985b78acf8,0x7f97b401e550,0x7f985b78acfc,0x7f985b78ad08,0x26f2890,0x7f985b78ad00,0x7f985b78ac28,0x7f985b 28.77ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
   ```
   FC [additional MKL_VERBOSE before DNNL_VERBOSE]
   ```
   MKL_VERBOSE SGEMM(T,N,512,4,512,0x7f985b789c28,0x7f97b5e52e80,512,0x7f97b401e600,512,0x7f985b789c30,0x7f976e89dd80,512) 39.68us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
   dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb4ic512oc512,0.0769043
   MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f985b789c28,0x7f976c400100,2048,0x7f976e887100,2048,0x7f985b789c30,0x7f976e8da000,512) 79.41us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
   dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.11377
   ```
   ## Results
    Operator  	| LHS            	| RHS            	| MKL Default 	| MKL Workaround 	|
   |-----------	|----------------	|----------------	|-------------	|----------------	|
   | Dot       	| (4, 512, 512)  	| (4, 512, 512)  	| 4.1112      	| 4.8241         	|
   |           	| (5, 512, 512)  	| (5, 512, 512)  	| 6.4421      	| 7.607          	|
   |           	| (5, 512, 1536  	| (5, 512, 1536) 	| 20.3648     	| 19.2217        	|
   |           	| (5, 512, 2048) 	| (5, 512, 2048) 	| 23.3236     	| 23.2849        	|
   |           	| (5, 2048, 512) 	| (5, 2048, 512) 	| 123.1235    	| 123.9806       	|
   |           	|                	|                	|             	|                	|
   | Batch_dot 	| (4, 512, 512)  	| (4, 512, 512)  	| 1.4105      	| 1.407          	|
   |           	| (5, 512, 512)  	| (5, 512, 512)  	| 1.7558      	| 1.7511         	|
   |           	| (5, 512, 1536) 	| (5, 512, 1536) 	| 6.5931      	| 6.5585         	|
   |           	| (5, 512, 2048) 	| (5, 512, 2048) 	| 9.1452      	| 9.1031         	|
   |           	| (5, 2048, 512) 	| (5, 2048, 512) 	| 29.0192     	| 28.9236        	|
   
    Operator  	| Data            	| Weight            	| MKL Default 	| MKL Workaround 	|
   |-----------	|----------------	|----------------	|-------------	|----------------	|
   | FC        	| (4, 512)       	| (512, 512)     	| 0.057       	| 0.0685         	|
   |           	| (5, 512)       	| (512, 512)     	| 0.0591      	| 0.0698         	|
   |           	| (5, 512)       	| (1536, 512)    	| 0.0823      	| 0.0939         	|
   |           	| (5, 512)       	| (2048, 512)    	| 0.0916      	| 0.1026         	|
   |           	| (5, 2048)      	| (512, 2048)    	| 0.1146      	| 0.1267         	|


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] pengzhao-intel closed issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

pengzhao-intel closed issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

ChaiBapchya edited a comment on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-623803913


   > In case somebody finds this issue and wants their optimized build, here is a different workaround that removes the need for `LD_PRELOAD`. Just do this before running cmake the first time:
   > 
   > ```shell
   > export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
   > ```
   > 
   > Then `cmake` can be run normally:
   > 
   > ```shell
   > cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
   > ```
   > 
   > and the compiled MXNet can be run normally without any special environment variables.
   
   @kpuatamazon Hi I was trying to benchmark using opperf for mkl [default] vs workaround
   And despite ensuring mkl is installed & using export CXXFlags followed by usual cmake command, build failed with 
   ```
   gemm.cpp:(.text+0xb45): undefined reference to `cblas_gemm_s8u8s32'
   ```
   
   I tried the undocumented abominable kludge option you mentioned and that worked smoothly.
   ```
   export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
   rm -rf build/
   mkdir -p build && cd build
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..
   cmake --build . --parallel 1024
   ```
   
   Script for OpPerf : https://gist.github.com/ChaiBapchya/5f2342f75ddeb1e21f14acac665c76ad
   
   Results
   | Operator  	| LHS            	| RHS            	| MKL Default 	| MKL Workaround 	|
   |-----------	|----------------	|----------------	|-------------	|----------------	|
   | Dot       	| (4, 512, 512)  	| (4, 512, 512)  	| 15.1122     	| 4.1254         	|
   |           	| (5, 512, 512)  	| (5, 512, 512)  	| 38.1678     	| 7.5323         	|
   |           	| (5, 512, 1536  	| (5, 512, 1536) 	| 21.6601     	| 19.2503        	|
   |           	| (5, 512, 2048) 	| (5, 512, 2048) 	| 29.0369     	| 23.7432        	|
   |           	| (5, 2048, 512) 	| (5, 2048, 512) 	| 167.5528    	| 129.9957       	|
   |           	|                	|                	|             	|                	|
   | Batch_dot 	| (4, 512, 512)  	| (4, 512, 512)  	| 1.7898      	| 1.5445         	|
   |           	| (5, 512, 512)  	| (5, 512, 512)  	| 2.2457      	| 1.9361         	|
   |           	| (5, 512, 1536) 	| (5, 512, 1536) 	| 6.1453      	| 5.4034         	|
   |           	| (5, 512, 2048) 	| (5, 512, 2048) 	| 8.246       	| 8.0442         	|
   |           	| (5, 2048, 512) 	| (5, 2048, 512) 	| 160.6243    	| 29.0772        	|
   |           	|                	|                	|             	|                	|
   |           	|  **Data**         	|  **Weight**         	|             	|                	|
   | FC        	| (4, 512)       	| (512, 512)     	| 0.0609      	| 0.068          	|
   |           	| (5, 512)       	| (512, 512)     	| 0.0633      	| 0.0731         	|
   |           	| (5, 512)       	| (1536, 512)    	| 0.0916      	| 0.0996         	|
   |           	| (5, 512)       	| (2048, 512)    	| 0.1081      	| 0.1084         	|
   |           	|                	|                	|             	|                	|
   
   However @kpuatamazon when I try to test out with default [i.e. default -> workaround -> default] by unsetting the environment variable LD_PRELOAD, it failed to build default with `gemm.cpp:(.text+0xe6b): undefined reference to `cblas_gemm_s8u8s32'`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ChaiBapchya commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

ChaiBapchya commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-623803913


   > In case somebody finds this issue and wants their optimized build, here is a different workaround that removes the need for `LD_PRELOAD`. Just do this before running cmake the first time:
   > 
   > ```shell
   > export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
   > ```
   > 
   > Then `cmake` can be run normally:
   > 
   > ```shell
   > cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
   > ```
   > 
   > and the compiled MXNet can be run normally without any special environment variables.
   
   @kpuatamazon Hi I was trying to benchmark using opperf for mkl [default] vs workaround
   And despite ensuring mkl is installed & using export CXXFlags followed by usual cmake command, build failed with 
   ```
   gemm.cpp:(.text+0xb45): undefined reference to `cblas_gemm_s8u8s32'
   ```
   
   I tried the undocumented abominable kludge option you mentioned and that worked smoothly.
   ```
   export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
   rm -rf build/
   mkdir -p build && cd build
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..
   cmake --build . --parallel 1024
   ```
   
   Script for OpPerf : https://gist.github.com/ChaiBapchya/5f2342f75ddeb1e21f14acac665c76ad
   
   Results
   | Operator  	| LHS            	| RHS            	| MKL Default 	| MKL Workaround 	|
   |-----------	|----------------	|----------------	|-------------	|----------------	|
   | Dot       	| (4, 512, 512)  	| (4, 512, 512)  	| 15.1122     	| 4.1254         	|
   |           	| (5, 512, 512)  	| (5, 512, 512)  	| 38.1678     	| 7.5323         	|
   |           	| (5, 512, 1536  	| (5, 512, 1536) 	| 21.6601     	| 19.2503        	|
   |           	| (5, 512, 2048) 	| (5, 512, 2048) 	| 29.0369     	| 23.7432        	|
   |           	| (5, 2048, 512) 	| (5, 2048, 512) 	| 167.5528    	| 129.9957       	|
   |           	|                	|                	|             	|                	|
   | Batch_dot 	| (4, 512, 512)  	| (4, 512, 512)  	| 1.7898      	| 1.5445         	|
   |           	| (5, 512, 512)  	| (5, 512, 512)  	| 2.2457      	| 1.9361         	|
   |           	| (5, 512, 1536) 	| (5, 512, 1536) 	| 6.1453      	| 5.4034         	|
   |           	| (5, 512, 2048) 	| (5, 512, 2048) 	| 8.246       	| 8.0442         	|
   |           	| (5, 2048, 512) 	| (5, 2048, 512) 	| 160.6243    	| 29.0772        	|
   |           	|                	|                	|             	|                	|
   | FC        	| (4, 512)       	| (512, 512)     	| 0.0609      	| 0.068          	|
   |           	| (5, 512)       	| (512, 512)     	| 0.0633      	| 0.0731         	|
   |           	| (5, 512)       	| (1536, 512)    	| 0.0916      	| 0.0996         	|
   |           	| (5, 512)       	| (2048, 512)    	| 0.1081      	| 0.1084         	|
   |           	|                	|                	|             	|                	|
   
   However @kpuatamazon when I try to test out with default by unsetting the environment variable LD_PRELOAD, it failed to build default with `gemm.cpp:(.text+0xe6b): undefined reference to `cblas_gemm_s8u8s32'`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] kpu commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

kpu commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-629400657


   `dot` hasn't been taken over by MKLDNN = DNNL = oneAPI yet, so you shouldn't expect to see a difference there because it never calls MKLDNN's GEMM anyway.  FullyConnected has been taken over by MKLDNN = DNNL = oneAPI.  This inconsistency is part of the bug report.  Your logs agree with these statements.  
   
   Do you really mean LHS (4, 512, 512)?  I'm talking about LHS (4,512) RHS (512, 512).  
   
   What CPU is this on?  Note that mine were on a Skylake Xeon `c9.9xlarge`.  I'd expect the AVX512 kernels to be pretty different from the AVX2 kernels.  
   
   What OMP threading were you using?  I'd expect for matrices this small that thread scaling is terrible for anything but a tiny number of threads.  (I mentioned using 4).  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] anko-intel commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

Posted by GitBox <gi...@apache.org>.

anko-intel commented on issue #17980:
URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-703432715


   @ChaiBapchya, You are right, we also have to include 1.6.4 (https://github.com/apache/incubator-mxnet/pull/19251) to make it happened.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org