You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mxnet.apache.org by Chris Olivier <cj...@gmail.com> on 2019/06/18 16:24:11 UTC

OMP

I am very reluctant to feed the trolls again, and this will be teh last
time I address Pedro or Anton on the subject, but since I think the numbers
being presented are incorrect (either by te builders not really
understanding what they are building, or possibly intentional misdirection):

Turning Intel OMP on and off (and MKL as well, since it tends to pull in
omp, depending which one is linked in).
There is a HUGE difference.  This is consistent with my experience before
when it was added.


default mnist:

python ../example/image-classification/train_mnist.py
INFO:root:start with arguments Namespace(add_stn=False, batch_size=64,
disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
gpus=None, image_shape='1, 28, 28', initializer='default',
kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
monitor=0, network='mlp', num_classes=10, num_epochs=20,
num_examples=60000, num_layers=None, optimizer='sgd',
profile_server_suffix='', profile_worker_suffix='', save_period=1,
test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)

INTEL OMP:

ldd libmxnet.so | grep omp
        libomp.so =>
/home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
(0x00007f978fde7000)

:root:Epoch[0] Batch [0-100]        Speed: 31548.09 samples/sec
accuracy=0.780012
INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21 samples/sec
accuracy=0.920469
INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91 samples/sec
accuracy=0.928281
INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36 samples/sec
accuracy=0.942813
INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79 samples/sec
accuracy=0.938750
INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52 samples/sec
accuracy=0.946562
INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41 samples/sec
accuracy=0.953281
INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73 samples/sec
accuracy=0.951562
INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88 samples/sec
accuracy=0.957500
INFO:root:Epoch[0] Train-accuracy=0.925423
INFO:root:Epoch[0] Time cost=3.806
INFO:root:Epoch[0] Validation-accuracy=0.962580
INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21 samples/sec
accuracy=0.968131
INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03 samples/sec
accuracy=0.966250


LIBGOMP:

ldd libmxnet.so | grep omp
        libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
(0x00007f25c25dd000)

INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01 samples/sec
 accuracy=0.782488
INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32 samples/sec
 accuracy=0.907813
INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00 samples/sec
 accuracy=0.927188
INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45 samples/sec
 accuracy=0.937969
INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95 samples/sec
 accuracy=0.942187
INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58 samples/sec
 accuracy=0.950156
INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16 samples/sec
 accuracy=0.947969
INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07 samples/sec
 accuracy=0.953750
INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58 samples/sec
accuracy=0.953125

That being said, there's other issued beyond speed.  The DEFAULT build from
makefile (not CMake) uses Intel OMP mkl (I showed before) and mysteriously
it has no issues?  This seems highly suspicious.  All I see is a lot of
hand-waving and conjecture and pointing to StackOverflow posts made by
people who may be of questionable pedigree to begin with.  This smells of a
Pedro-ego-fight rather than one of purely technical merit.  Also, if one
knows how OMP works,  they would be very suspicious of the "intermittent
hangs" claim -- that's probably just broken race conditions elsewhere until
proven differently.  It'd tend freeze on the first use if something is
wrong (try using libgomp after a fork and see), since worker threads"
wouldn't be assigned/joined properly.  IntelOMP is faster, but also has
other advantages, such as allowing OMP after a fork.

I actually addressed a lot of issues and ask for clarification in the
original PR's way back when, but they're all just ignored.

-Chris

Re: OMP

Posted by Pedro Larroy <pe...@gmail.com>.
1) I don't think the problem is related to environment variables but
is good to know that you agree we can remove the modification to
OMP_NUM_THREADS which was creating random crashes with low
probability.
For the answer to your question I suggest to use debugger and
reasoning about initialization, static construction and thread
creation using pthread_atfork, I think we are mixing all those
difficult and subtle actions with side effects and we have the perfect
storm,

I think one of the possibilities is that two threads can initialize
openmp at the same time, or at least if OMP is initialized and the
operator tuning code is running at the same time you get this effect
of getting inside the OpenMP code before it has been initialized or
during initalization. __kmp_team_pool and other are volatile variables
used inside the OMP engine change by different threads. I verified
myself by modifying OpenMP and having a write watch in that memory
region that indeed they were changing value during
__kmp_do_serial_initialize and creating the assert described in this
issue: https://github.com/apache/incubator-mxnet/issues/10856

Read below for a more detailed explanation of this process with
backtraces included.

Kind regards.


====

omp_get_num_procs and other openmp functions are called from different
places concurrently, such as operator tuning and static initialization
of openmp here:

https://github.com/apache/incubator-mxnet/blob/master/src/engine/openmp.cc#L37

While static initialization is thread safe, the constructor of a
statically initialized object might not be thread safe.

Operator tuning is called during (static) library initialization .
Static initialization is by the C++ standard done in an implementation
defined order.

2) Is what I described above correct or it could cause some problems?
At least is causing an assertion inside OpenMP so seems to violate an
invariant that the OpenMP developers consider to hold, this is a bit
of a concern for me.

This explains the assertion during OpenMP. Let me know if you have any
other questions or if you think this is incorrect. If i'm not mistaken
you contributed some parts of this code.

3) Why the pip packages are using libgomp and when building from
source we are using llvm openmp? (I asked this question on the 1.5
release thread).

Below are some stack traces that justify my above observations and
reasonings, which I captured using a debugger.

__kmp_allocate_thread kmp_runtime.cpp:4153
__kmp_allocate_team kmp_runtime.cpp:4965
__kmp_fork_call kmp_runtime.cpp:1991
__kmp_GOMP_fork_call kmp_gsupport.cpp:290
__kmp_api_GOMP_parallel kmp_gsupport.cpp:1080
mxnet::op::OperatorTune<float>::GetOMPLoopOverhead operator_tune-inl.h:342
mxnet::op::OperatorTune<float>::GetOMPLoopOverhead operator_tune-inl.h:370
mxnet::op::OperatorTune<float>::Initialize operator_tune-inl.h:174
mxnet::op::OperatorTune<float>::TuneAll operator_tune-inl.h:220
mxnet::op::OperatorTune<float>::OperatorTune operator_tune-inl.h:116
mxnet::op::UnaryOpTune<float>::UnaryOpTune operator_tune-inl.h:534
mxnet::op::BinaryOpTune<float>::BinaryOpTune operator_tune-inl.h:724
__static_initialization_and_destruction_0 operator_tune.cc:369
_GLOBAL__sub_I_operator_tune.cc(void) operator_tune.cc:378
call_init 0x00007f8f4e41d733
_dl_init 0x00007f8f4e41d733
dl_open_worker 0x00007f8f4e4221ff
__GI__dl_catch_exception 0x00007f8f4e1832df
_dl_open 0x00007f8f4e4217ca
dlopen_doit 0x00007f8f4dbf9f96
__GI__dl_catch_exception 0x00007f8f4e1832df
__GI__dl_catch_error 0x00007f8f4e18336f
_dlerror_run 0x00007f8f4dbfa735
__dlopen 0x00007f8f4dbfa051
<unknown> 0x00007f8f4b3eacda
<unknown> 0x0000000000502d6f
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000504c28
_PyFunction_FastCallDict 0x0000000000501ba7
<unknown> 0x0000000000591461
<unknown> 0x000000000054b813
<unknown> 0x0000000000555421
_PyObject_FastCallKeywords 0x00000000005a730c
<unknown> 0x0000000000503073
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000504c28
<unknown> 0x0000000000511d78
PyCFunction_Call 0x000000000056617e
_PyEval_EvalFrameDefault 0x000000000050bb66
<unknown> 0x0000000000504c28
<unknown> 0x0000000000502540
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
_PyFunction_FastCallDict 0x0000000000501945
_PyObject_FastCallDict 0x00000000005a36f1
_PyObject_CallMethodIdObjArgs 0x000000000059662e
PyImport_ImportModuleLevelObject 0x00000000004ee84d
_PyEval_EvalFrameDefault 0x000000000050896c
<unknown> 0x0000000000504c28
<unknown> 0x0000000000511d78
PyCFunction_Call 0x000000000056617e
_PyEval_EvalFrameDefault 0x000000000050bb66
<unknown> 0x0000000000504c28
<unknown> 0x0000000000502540
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
_PyFunction_FastCallDict 0x0000000000501945
_PyObject_FastCallDict 0x00000000005a36f1
_PyObject_CallMethodIdObjArgs 0x000000000059662e
PyImport_ImportModuleLevelObject 0x00000000004ee84d
_PyEval_EvalFrameDefault 0x000000000050896c
<unknown> 0x0000000000504c28
<unknown> 0x0000000000511d78
PyCFunction_Call 0x000000000056617e
_PyEval_EvalFrameDefault 0x000000000050bb66
<unknown> 0x0000000000504c28
<unknown> 0x0000000000502540
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
_PyFunction_FastCallDict 0x0000000000501945
_PyObject_FastCallDict 0x00000000005a36f1
_PyObject_CallMethodIdObjArgs 0x000000000059662e
PyImport_ImportModuleLevelObject 0x00000000004ee84d
_PyEval_EvalFrameDefault 0x000000000050896c
<unknown> 0x0000000000504c28
PyEval_EvalCode 0x0000000000506393
<unknown> 0x0000000000634d52
PyRun_FileExFlags 0x0000000000634e0a
PyRun_SimpleFileExFlags 0x00000000006385c8
Py_Main 0x000000000063915a
main 0x00000000004a6f10
__libc_start_main 0x00007f8f4e03db97
_start 0x00000000005afa0a


(py3_venv) piotr@panther:0: ~/d/mxnet [master]> nosetests -v -s
tests/python/unittest/test_gluon.py 2>&1 | head
kmp __kmp_do_serial_initialize: kmp_team_pool: 0
kmp __kmp_do_serial_initialize xx: kmp_team_pool: 0
kmp __kmp_do_serial_initialize: kmp_team_pool: 1
Assertion failure at kmp_runtime.cpp(6488): __kmp_team_pool == __null.




Due to pthread_atfork handlers, we are incurring on a significant
overhead when creating any thread. I think we are creating and
destroying the engine thread pool on EVERY thread initialization as
shown in this backtrace:

Thread-1-[python]:

 __GI___pthread_timedjoin_ex 0x00007f9d7708bd2d

std::thread::join() 0x00007f9d199428c3
mxnet::engine::ThreadPool::~ThreadPool thread_pool.h:84
std::default_delete<mxnet::engine::ThreadPool>::operator() unique_ptr.h:78
std::unique_ptr<mxnet::engine::ThreadPool,
std::default_delete<mxnet::engine::ThreadPool> >::~unique_ptr
unique_ptr.h:268
mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1>::~ThreadWorkerBlock
threaded_engine_perdevice.cc:212
std::default_delete<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1>
>::operator() unique_ptr.h:78
std::unique_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1>,
std::default_delete<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1>
> >::reset unique_ptr.h:376
mxnet::engine::ThreadedEnginePerDevice::StopNoWait
threaded_engine_perdevice.cc:67
mxnet::engine::ThreadedEnginePerDevice::Stop threaded_engine_perdevice.cc:73
mxnet::LibraryInitializer::LibraryInitializer()::{lambda()#1}::operator()()
const initialize.cc:61
mxnet::LibraryInitializer::LibraryInitializer()::{lambda()#1}::_FUN()
initialize.cc:62
__libc_fork 0x00007f9d77386aca
<unknown> 0x00000000005e8https://github.com/apache/incubator-mxnet/blob/master/src/engine/openmp.cc#L37646
<unknown> 0x0000000000502d6f
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000504c28
<unknown> 0x0000000000502540
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000504c28
_PyFunction_FastCallDict 0x0000000000501b2e
<unknown> 0x0000000000591461
<unknown> 0x000000000054b813
<unknown> 0x0000000000555421
_PyObject_FastCallKeywords 0x00000000005a730c
<unknown> 0x0000000000503073
_PyEval_EvalFrameDefault 0x0000000000507641
<unknown> 0x0000000000504c28
<unknown> 0x0000000000502540
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000504c28
<unknown> 0x0000000000502540
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000504c28
<unknown> 0x0000000000510f36
<unknown> 0x00000000005030d5
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000504c28
<unknown> 0x0000000000511d78
PyCFunction_Call 0x000000000056617e
_PyEval_EvalFrameDefault 0x000000000050bb66
<unknown> 0x0000000000504c28
<unknown> 0x0000000000502540
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
_PyFunction_FastCallDict 0x0000000000501945
_PyObject_FastCallDict 0x00000000005a36f1
_PyObject_CallMethodIdObjArgs 0x000000000059662e
PyImport_ImportModuleLevelObject 0x00000000004ee84d
<unknown> 0x0000000000510dc4
PyCFunction_Call 0x0000000000566103
_PyEval_EvalFrameDefault 0x000000000050bb66
<unknown> 0x0000000000504c28
<unknown> 0x0000000000502540
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000504c28
_PyFunction_FastCallDict 0x0000000000501b2e
_PyObject_FastCallDict 0x00000000005a36f1
_PyObject_CallMethodIdObjArgs 0x000000000059662e
PyImport_ImportModuleLevelObject 0x00000000004ee790
_PyEval_EvalFrameDefault 0x000000000050896c
<unknown> 0x0000000000504c28
<unknown> 0x0000000000511d78
PyCFunction_Call 0x000000000056617e
_PyEval_EvalFrameDefault 0x000000000050bb66
<unknown> 0x0000000000504c28
<unknown> 0x0000000000502540
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
_PyFunction_FastCallDict 0x0000000000501945
_PyObject_FastCallDict 0x00000000005a36f1
_PyObject_CallMethodIdObjArgs 0x000000000059662e
PyImport_ImportModuleLevelObject 0x00000000004ee84d
<unknown> 0x0000000000510dc4
PyCFunction_Call 0x0000000000566103
_PyEval_EvalFrameDefault 0x000000000050bb66
<unknown> 0x0000000000504c28
<unknown> 0x0000000000502540
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000504c28
_PyFunction_FastCallDict 0x0000000000501b2e
_PyObject_FastCallDict 0x00000000005a36f1
_PyObject_CallMethodIdObjArgs 0x000000000059662e
PyImport_ImportModuleLevelObject 0x00000000004ee790
_PyEval_EvalFrameDefault 0x000000000050896c
<unknown> 0x0000000000504c28
<unknown> 0x0000000000511d78
PyCFunction_Call 0x000000000056617e
_PyEval_EvalFrameDefault 0x000000000050bb66
<unknown> 0x0000000000504c28
<unknown> 0x0000000000502540
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
_PyFunction_FastCallDict 0x0000000000501945
_PyObject_FastCallDict 0x00000000005a36f1
_PyObject_CallMethodIdObjArgs 0x000000000059662e
PyImport_ImportModuleLevelObject 0x00000000004ee84d
<unknown> 0x0000000000510dc4
PyCFunction_Call 0x0000000000566103
_PyEval_EvalFrameDefault 0x000000000050bb66
<unknown> 0x0000000000504c28
<unknown> 0x0000000000502540
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000504c28
_PyFunction_FastCallDict 0x0000000000501b2e
_PyObject_FastCallDict 0x00000000005a36f1
_PyObject_CallMethodIdObjArgs 0x000000000059662e
PyImport_ImportModuleLevelObject 0x00000000004ee790
_PyEval_EvalFrameDefault 0x000000000050896c
<unknown> 0x0000000000504c28
<unknown> 0x0000000000511d78
PyCFunction_Call 0x000000000056617e
_PyEval_EvalFrameDefault 0x000000000050bb66
<unknown> 0x0000000000504c28
<unknown> 0x0000000000502540
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
_PyFunction_FastCallDict 0x0000000000501945
_PyObject_FastCallDict 0x00000000005a36f1
_PyObject_CallMethodIdObjArgs 0x000000000059662e
PyImport_ImportModuleLevelObject 0x00000000004ee84d
_PyEval_EvalFrameDefault 0x000000000050896c
<unknown> 0x0000000000504c28
PyEval_EvalCode 0x0000000000506393
<unknown> 0x0000000000634d52
PyRun_FileExFlags 0x0000000000634e0a
PyRun_SimpleFileExFlags 0x00000000006385c8
Py_Main 0x000000000063915a
main 0x00000000004a6f10
__libc_start_main 0x00007f9d772c3b97
_start 0x00000000005afa0a



At the same time, another thread is also inside the omp runtime:
Thread-82:

__kmp_free_team kmp_runtime.cpp:5331
__kmp_reset_root kmp_runtime.cpp:3850
__kmp_unregister_root_current_thread kmp_runtime.cpp:3942
__kmp_internal_end_thread kmp_runtime.cpp:6057
__kmp_internal_end_dest kmp_runtime.cpp:5620
__nptl_deallocate_tsd 0x00007f9d77089408
__nptl_deallocate_tsd 0x00007f9d7708a81b
start_thread 0x00007f9d7708a81b
clone 0x00007f9d773c388f


==================== Different timepoint


__kmp_do_serial_initialize kmp_runtime.cpp:6485
__kmp_do_middle_initialize kmp_runtime.cpp:6597
__kmp_middle_initialize kmp_runtime.cpp:6706
__kmp_api_omp_get_num_procs kmp_ftn_entry.h:405
mxnet::engine::OpenMP::OpenMP openmp.cc:49
mxnet::engine::OpenMP::Get openmp.cc:37
__static_initialization_and_destruction_0 openmp.cc:110
_GLOBAL__sub_I_openmp.cc(void) openmp.cc:113
call_init 0x00007f439c83a733
_dl_init 0x00007f439c83a733
dl_open_worker 0x00007f439c83f1ff
__GI__dl_catch_exception 0x00007f439c5a02df
_dl_open 0x00007f439c83e7ca
dlopen_doit 0x00007f439c016f96
__GI__dl_catch_exception 0x00007f439c5a02df
__GI__dl_catch_error 0x00007f439c5a036f
_dlerror_run 0x00007f439c017735
__dlopen 0x00007f439c017051
<unknown> 0x00007f4399807cda
<unknown> 0x0000000000502d6f
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000504c28
_PyFunction_FastCallDict 0x0000000000501ba7
<unknown> 0x0000000000591461
<unknown> 0x000000000054b813
<unknown> 0x0000000000555421
_PyObject_FastCallKeywords 0x00000000005a730c
<unknown> 0x0000000000503073
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000504c28
<unknown> 0x0000000000511d78
PyCFunction_Call 0x000000000056617e
_PyEval_EvalFrameDefault 0x000000000050bb66
<unknown> 0x0000000000504c28
<unknown> 0x0000000000502540
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
_PyFunction_FastCallDict 0x0000000000501945
_PyObject_FastCallDict 0x00000000005a36f1
_PyObject_CallMethodIdObjArgs 0x000000000059662e
PyImport_ImportModuleLevelObject 0x00000000004ee84d
_PyEval_EvalFrameDefault 0x000000000050896c
<unknown> 0x0000000000504c28
<unknown> 0x0000000000511d78
PyCFunction_Call 0x000000000056617e
_PyEval_EvalFrameDefault 0x000000000050bb66
<unknown> 0x0000000000504c28
<unknown> 0x0000000000502540
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
_PyFunction_FastCallDict 0x0000000000501945
_PyObject_FastCallDict 0x00000000005a36f1
_PyObject_CallMethodIdObjArgs 0x000000000059662e
PyImport_ImportModuleLevelObject 0x00000000004ee84d
_PyEval_EvalFrameDefault 0x000000000050896c
<unknown> 0x0000000000504c28
<unknown> 0x0000000000511d78
PyCFunction_Call 0x000000000056617e
_PyEval_EvalFrameDefault 0x000000000050bb66
<unknown> 0x0000000000504c28
<unknown> 0x0000000000502540
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
<unknown> 0x0000000000502209
<unknown> 0x0000000000502f3d
_PyEval_EvalFrameDefault 0x0000000000506859
_PyFunction_FastCallDict 0x0000000000501945
_PyObject_FastCallDict 0x00000000005a36f1
_PyObject_CallMethodIdObjArgs 0x000000000059662e
PyImport_ImportModuleLevelObject 0x00000000004ee84d
_PyEval_EvalFrameDefault 0x000000000050896c
<unknown> 0x0000000000504c28
PyEval_EvalCode 0x0000000000506393
<unknown> 0x0000000000634d52
PyRun_FileExFlags 0x0000000000634e0a
PyRun_SimpleFileExFlags 0x00000000006385c8
Py_Main 0x000000000063915a
main 0x00000000004a6f10
__libc_start_main 0x00007f439c45ab97
_start 0x00000000005afa0a


On Tue, Jun 25, 2019 at 1:55 PM Chris Olivier <cj...@gmail.com> wrote:
>
> 1) I don't see how that code could cause reentrancy problems in omp. It
> doesn't make any OMP calls at all.  Still doesn't look related to me.
> Setting an environment variable probably doesn't even do anything, because:
>   a) It probably doesn't check the environment variable except at initial
> startup
>   b) Even if it did, whether this code ran before or after the OMP init
> code would be nondeterministic
>   c) It for sure doesn't check the environment variable every time it hits
> an omp region.  That would be ridiculously expensive and checking the OMP
> source code, it doesn't..  You can't affect the OMP behavior at arbitrary
> points in time by setting the "OMP_NUM_THREADS" environment variable.
>
>
>
>
> On Tue, Jun 25, 2019 at 1:20 PM Pedro Larroy <pe...@gmail.com>
> wrote:
>
> > Nobody claimed that the original lockup has to do with OMP, but the
> > fix caused re-entrancy into OMP initialization as explained below. So
> > I agree with your statement that the bug that using pthread_atfork was
> > fixing is not related with OMP, but the fix is causing interactions
> > with OMP as described above.
> >
> > Pedro.
> >
> > On Tue, Jun 25, 2019 at 12:33 PM Chris Olivier <cj...@gmail.com>
> > wrote:
> > >
> > > The call stacks there are mostly associated with the execution engine
> > > threads, which are not OMP threads.  That lockup doesn't look to me to be
> > > related to OMP   -- the execution engine uses its own thread pool logic
> > --
> > > I'm pretty familiar with that part of the code.  Unless I am missing one
> > --
> > > can you point to the one that looks OMP-related?
> > >
> > >
> > > On Tue, Jun 25, 2019 at 10:35 AM Pedro Larroy <
> > pedro.larroy.lists@gmail.com>
> > > wrote:
> > >
> > > > Thanks for digging that out Kellen. That's good info so maybe it would
> > > > be good to rework the fix with the info you provided and remove the
> > > > pthread_atfork handlers.
> > > > Do you think setting the device would avoid the problem seen on the
> > > > backtrace you provided?  specifically here:
> > > >
> > > >
> > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600#file-hang_bt-L24
> > > >
> > > > On Mon, Jun 24, 2019 at 6:43 PM kellen sunderland
> > > > <ke...@gmail.com> wrote:
> > > > >
> > > > > I remember at the time we also had a read through of this blog post,
> > but
> > > > to
> > > > > use the code looked like it was following the advice:
> > > > >
> > > >
> > https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
> > > > >
> > > > > On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland <
> > > > > kellen.sunderland@gmail.com> wrote:
> > > > >
> > > > > > I remember this hang as well, it was pretty hard to reproduce
> > IIRC.  I
> > > > > > believe the stacks for the hang are here:
> > > > > >
> > > >
> > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600
> > > > and
> > > > > > the trick was we could only debug it up to the point that we hit:
> > > > > >
> > > > > > #0  0x00007fec6df1ba4f in futex_wait (private=0, expected=1,
> > > > > > futex_word=0x7fec60843758)
> > > > > > at ../sysdeps/unix/sysv/linux/futex-internal.h:61
> > > > > > #1  futex_wait_simple (private=0, expected=1,
> > > > futex_word=0x7fec60843758)
> > > > > >     at ../sysdeps/nptl/futex-internal.h:135
> > > > > > #2  __pthread_once_slow (once_control=0x7fec60843758,
> > > > > > init_routine=0x7fec605f38f0)
> > > > > >     at pthread_once.c:105
> > > > > > ...
> > > > > > #6  0x00007fec6061c577 in cudaSetDevice () from
> > > > > > /usr/local/cuda/lib64/libcudart.so.9.0
> > > > > >
> > > > > > because the code in libcudart is obviously closed source we
> > couldn't
> > > > dig
> > > > > > into what threading work was going on when we called cudaSetDevice.
> > > > > >
> > > > > > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy <
> > > > pedro.larroy.lists@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> If you check initialize.cc we seem to be explicitly disabling that
> > > > > >> behaviour in pthread_at_fork which seems to cause thread
> > contention
> > > > > >> during multiprocessing. Why do we need this major advantage for
> > the
> > > > > >> library if that's the case?
> > > > > >>
> > > > > >> Related PRs:
> > > > > >>
> > > > > >> https://github.com/apache/incubator-mxnet/pull/10820
> > > > > >> https://github.com/apache/incubator-mxnet/issues/14396
> > > > > >>
> > > > > >> The original code was authored in this PR:
> > > > > >>
> > > > > >> https://github.com/apache/incubator-mxnet/pull/8677
> > > > > >>
> > > > > >> I actually remember this fix, it was done during a release as the
> > cuda
> > > > > >> runtime was forking and the engine was being re-entered. If that
> > > > > >> situation is now happening anymore it might not be needed any
> > longer.
> > > > > >> I don't think we know the cause why there was a fork inside cuda,
> > so
> > > > > >> the code has grown around a fix for an issue which its root cause
> > was
> > > > > >> not understood, and side effects which this fix caused afterwards.
> > > > > >>
> > > > > >> My build uses MKL+LLVM OMP+DEBUG as seen in the container
> > provided in
> > > > > >> the link above, no libgomp.
> > > > > >>
> > > > > >> I didn't try the Make build.
> > > > > >>
> > > > > >> I would refactor the code linked above and stop using
> > pthread_at_fork,
> > > > > >> since OMP assumes it won't be initialized twice, but needs to be
> > very
> > > > > >> well tested to make sure it doesn't cause bugs or affect the fixes
> > > > > >> done on the linked PRs above.
> > > > > >>
> > > > > >> Pedro.
> > > > > >>
> > > > > >> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier <
> > cjolivier01@gmail.com>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > one major advantage of intel/llvm omp is that it spawns a new
> > thread
> > > > > >> pool
> > > > > >> > after fork if a thread pool was already created. this is so
> > that omp
> > > > > >> can be
> > > > > >> > used in the forked processes. libgomp doesn’t do this so it’ll
> > just
> > > > > >> lock up
> > > > > >> > if you try to do omp in the forked process.
> > > > > >> >
> > > > > >> > is your build linking libgomp as well?
> > > > > >> >
> > > > > >> > standard mkl build (from Makefile) uses same omp library. are
> > there
> > > > > >> > problems with that build?
> > > > > >> >
> > > > > >> > what changes need to be made to make the assertion not fire?
> > > > > >> >
> > > > > >> > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy <
> > > > > >> pedro.larroy.lists@gmail.com>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > There's an assertion which is easily reproducible, and also
> > > > there's a
> > > > > >> > > crash including core dump, the latter is not easy to reproduce
> > > > for me
> > > > > >> > > in different environments. I have also seen mxnet getting
> > stuck
> > > > > >> > > without progressing with this build configuration and using no
> > > > CPU at
> > > > > >> > > all when running unit tests.
> > > > > >> > >
> > > > > >> > > In my view, the root cause of the assertion is that we are
> > > > re-entering
> > > > > >> > > OMP initialization when spawning threads on the following code
> > > > through
> > > > > >> > > pthread_at_fork
> > > > > >> > >
> > > > > >> > >
> > > > > >>
> > > >
> > https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
> > > > > >> > >
> > > > > >> > > This causes double initialization of the OMP engine,
> > including the
> > > > > >> > > assertion which you are asking about,  and I suspect some
> > > > additional
> > > > > >> > > overhead. That's the shady forking part you are asking for.
> > > > > >> > >
> > > > > >> > > A question for you: What is the cause of runtime differences
> > > > between
> > > > > >> > > OMP runtimes? Shouldn't the implementation overhead diminish
> > as
> > > > > >> > > threads run longer?
> > > > > >> > >
> > > > > >> > > Pedro.
> > > > > >> > >
> > > > > >> > > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier <
> > > > cjolivier01@gmail.com>
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > > What’s the reason for the assertion failure? btw
> > classifying an
> > > > > >> assertion
> > > > > >> > > > failure a “crash” is debatable. As I stated in the original
> > > > issue a
> > > > > >> long
> > > > > >> > > > time ago, it’s possible something shady is being done with
> > when
> > > > > >> forking
> > > > > >> > > > that should be fixed.  The assertion should be root caused.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <
> > > > > >> > > pedro.larroy.lists@gmail.com>
> > > > > >> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Added a dockerfile, and reports of a crash in my local
> > machine
> > > > > >> when
> > > > > >> > > > > running MKL+OMP+DEBUG, with Anton's branch the crash
> > happened
> > > > as
> > > > > >> well.
> > > > > >> > > > > I couldn't reproduce the crash on my EC2 machine:
> > > > > >> > > > > Added the backtrace of the crash as well.
> > > > > >> > > > >
> > > > > >> > > > > https://github.com/apache/incubator-mxnet/issues/10856
> > > > > >> > > > >
> > > > > >> > > > > Dockerfile here:
> > > > > >> > > > >
> > > > > >> > > > > https://github.com/larroy/mxnet_omp
> > > > > >> > > > >
> > > > > >> > > > > Kind regards.
> > > > > >> > > > >
> > > > > >> > > > > Pedro.
> > > > > >> > > > >
> > > > > >> > > > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <
> > > > > >> > > marco.g.abreu@gmail.com>
> > > > > >> > > > > wrote:
> > > > > >> > > > > >
> > > > > >> > > > > > As already proposed, I think the easiest way to get a
> > common
> > > > > >> > > > > understanding
> > > > > >> > > > > > is if we start with a few docker containers. Pedro,
> > would
> > > > it be
> > > > > >> > > possible
> > > > > >> > > > > > for you to wrap your benchmarks into a few containers
> > that
> > > > will
> > > > > >> > > produce
> > > > > >> > > > > > your shown results? That way, we can avoid possible
> > > > > >> > > misunderstandings and
> > > > > >> > > > > > also pinpoint the exact parts where people disagree or
> > > > > >> misunderstood
> > > > > >> > > each
> > > > > >> > > > > > other.
> > > > > >> > > > > >
> > > > > >> > > > > > -Marco
> > > > > >> > > > > >
> > > > > >> > > > > > Pedro Larroy <pe...@gmail.com> schrieb am
> > Do.,
> > > > > >> 20. Juni
> > > > > >> > > > > 2019,
> > > > > >> > > > > > 21:47:
> > > > > >> > > > > >
> > > > > >> > > > > > > I can confirm that we are linking with two versions of
> > > > omp,
> > > > > >> I'm
> > > > > >> > > > > > > gaining more clarity into this topic, but I have still
> > > > > >> questions,
> > > > > >> > > the
> > > > > >> > > > > > > facts that I got so far are the folllowing:
> > > > > >> > > > > > >
> > > > > >> > > > > > > * #1: We are linking with two versions of omp,
> > intel's omp
> > > > > >> and llvm
> > > > > >> > > > > > > openmp when building with MKL enabled.
> > > > > >> > > > > > > * #2: We have 3 different possible OMP versions:
> > Intel OMP
> > > > > >> (comes
> > > > > >> > > with
> > > > > >> > > > > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes
> > with
> > > > gcc)
> > > > > >> (This
> > > > > >> > > > > > > one is used on the PR proposed by Anton).
> > > > > >> > > > > > >
> > > > > >> > > > > > > Questions:
> > > > > >> > > > > > >
> > > > > >> > > > > > >  * #1 Is it ok to have two versions of openmp linked
> > at
> > > > the
> > > > > >> same
> > > > > >> > > time?
> > > > > >> > > > > > >  * #2 Which implementation of OMP gives the best
> > > > > >> performance?  (See
> > > > > >> > > > > > > total training time of my measurement for a partial
> > > > answer)
> > > > > >> > > > > > >  * #3 Should we have a build flag so we can choose
> > the OMP
> > > > > >> version
> > > > > >> > > at
> > > > > >> > > > > > > runtime?
> > > > > >> > > > > > >  * #4 Which Compiler and build flags did Chris use to
> > get
> > > > 10x
> > > > > >> > > slowdown?
> > > > > >> > > > > > >  * #5 @Stas: is there a script to replicate your
> > > > benchmarks
> > > > > >> > > easily? If
> > > > > >> > > > > > > so could you provide a link?  I think we would need to
> > > > > >> reproduce
> > > > > >> > > your
> > > > > >> > > > > > > benchmarks and verify which versions are being linked.
> > > > It's
> > > > > >> > > possible
> > > > > >> > > > > > > that while compiling with MKL intel's omp was pulled
> > in
> > > > > >> instead of
> > > > > >> > > > > > > GNU OpenMP.
> > > > > >> > > > > > >  * #6 @Chris: how to maintain the copy of LLVM's
> > Openmp?
> > > > > >> Should we
> > > > > >> > > > > > > update the subrepo regularly?
> > > > > >> > > > > > >
> > > > > >> > > > > > > My conclusion so far:
> > > > > >> > > > > > >
> > > > > >> > > > > > >  * #1 We should avoid linking two versions of omp if
> > > > possible
> > > > > >> and
> > > > > >> > > > > > > allow users to choose one in the build as we do for
> > BLAS.
> > > > > >> > > > > > >  * #2 For performance reasons and more control vs
> > > > different
> > > > > >> > > compiler
> > > > > >> > > > > > > versions seems it makes indeed sense to keep the LLVM
> > > > OpenMP
> > > > > >> > > version
> > > > > >> > > > > > > in 3rdparty for now. So unless some more data is
> > > > gathered, it
> > > > > >> makes
> > > > > >> > > > > > > sense not to remove it as of now.
> > > > > >> > > > > > >  * #3 We should provide build options to choose which
> > > > openmp
> > > > > >> > > library
> > > > > >> > > > > > > is to be used from the three options available,
> > including
> > > > > >> libgomp.
> > > > > >> > > > > > >  * #4 Refining the build we could also enable OpenMP
> > in
> > > > mac
> > > > > >> without
> > > > > >> > > > > > > additional contortions (doesn't work as of today):
> > > > > >> > > > > > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
> > > > > >> > > > > > >  * #5 We should add different omp versions to our
> > > > benchmarks
> > > > > >> and
> > > > > >> > > track
> > > > > >> > > > > > > the performance, so this data is available for
> > prescribing
> > > > > >> the best
> > > > > >> > > > > > > build options and for binary releases.
> > > > > >> > > > > > >
> > > > > >> > > > > > > This is also an interesting related gh issue posted
> > in the
> > > > > >> mkl-dnn
> > > > > >> > > > > > > repository:
> > https://github.com/intel/mkl-dnn/issues/230
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > I don't observe the order of magnitude divergence
> > > > reported by
> > > > > >> > > Chris in
> > > > > >> > > > > > > vanilla Ubuntu 18.04 in samples / s but the full
> > training
> > > > > >> finishes
> > > > > >> > > > > > > indeed faster with the OMP from 3rdparty (LLVM
> > openmp) vs
> > > > > >> libgomp.
> > > > > >> > > > > > >
> > > > > >> > > > > > > There's also differences in training time when using
> > MKL
> > > > and
> > > > > >> the ,
> > > > > >> > > > > > > it's actually a bit slower, I don't know if it's
> > related
> > > > to
> > > > > >> OMP.
> > > > > >> > > > > > >
> > > > > >> > > > > > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
> > > > > >> > > > > > >
> > > > > >> > > > > > > Anton's branch:  git@github.com:
> > lebeg/incubator-mxnet.git
> > > > > >>  branch
> > > > > >> > > > > 'omp'
> > > > > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> > > > > >> > > > > > > build/libmxnet.so |grep -i omp
> > > > > >> > > > > > >         libgomp.so.1 =>
> > > > /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > >> > > > > > > (0x00007fd99a51d000)
> > > > > >> > > > > > >
> > > > > >> > > > > > > time python train_mnist.py
> > > > > >> > > > > > >
> > > > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.984176
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed:
> > 41617.00
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=1.000000
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed:
> > 47990.69
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999531
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed:
> > 47517.01
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999687
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed:
> > 47430.53
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=1.000000
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed:
> > 47649.77
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999687
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed:
> > 51708.12
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999687
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed:
> > 57228.63
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999375
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed:
> > 50887.85
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999844
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed:
> > 53947.98
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999531
> > > > > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > > > >> > > > > > > INFO:root:Epoch[19] Time cost=1.219
> > > > > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983977
> > > > > >> > > > > > > 1011.98user 26.78system 0:31.54elapsed 3292%CPU
> > > > > >> (0avgtext+0avgdata
> > > > > >> > > > > > > 1146052maxresident)k
> > > > > >> > > > > > > 0inputs+0outputs (0major+3496364minor)pagefaults
> > 0swaps
> > > > > >> > > > > > >
> > > > > >> > > > > > > Master, MKL ON:
> > > > > >> > > > > > >
> > > > > >> > > > > > > (py3_venv) piotr@ec2 cpu:1:
> > ~/m/e/image-classification
> > > > > >> [master]>
> > > > > >> > > ldd
> > > > > >> > > > > > > ../../build/libmxnet.so | grep -i omp
> > > > > >> > > > > > >         libomp.so =>
> > > > > >> > > > > > >
> > > > > >> > >
> > > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > > > >> > > > > > > (0x00007f05ba38f000)
> > > > > >> > > > > > >         libiomp5.so =>
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > >
> > > > > >> > >
> > > > > >>
> > > >
> > /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
> > > > > >> > > > > > > (0x00007f05b09f4000)
> > > > > >> > > > > > >
> > > > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.982484
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed:
> > 36651.63
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999691
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed:
> > 45093.98
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999844
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed:
> > 45146.84
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999687
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed:
> > 45119.90
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999687
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed:
> > 44998.96
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999531
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed:
> > 45072.25
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999844
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed:
> > 44969.79
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999844
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed:
> > 44962.78
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999844
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed:
> > 44945.47
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999375
> > > > > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > > > >> > > > > > > INFO:root:Epoch[19] Time cost=1.367
> > > > > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.982783
> > > > > >> > > > > > > 854.97user 847.21system 0:41.44elapsed 4106%CPU
> > > > > >> (0avgtext+0avgdata
> > > > > >> > > > > > > 1154348maxresident)k
> > > > > >> > > > > > > 0inputs+0outputs (0major+3624361minor)pagefaults
> > 0swaps
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > MKL OFF:
> > > > > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]>
> > > > grep -i
> > > > > >> MKL
> > > > > >> > > > > > > cmake_options.yml
> > > > > >> > > > > > > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
> > > > > >> > > > > > > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if
> > MKL
> > > > > >> found) IF
> > > > > >> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > > >> > > > > > > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL
> > > > found)
> > > > > >> IF
> > > > > >> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]>
> > ldd
> > > > > >> > > > > > > build/libmxnet.so |grep -i omp
> > > > > >> > > > > > >         libomp.so =>
> > > > > >> > > > > > >
> > > > > >> > >
> > > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > > > >> > > > > > > (0x00007fb720c54000)
> > > > > >> > > > > > >
> > > > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.983479
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed:
> > 46784.02
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=1.000000
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed:
> > 48824.29
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999687
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed:
> > 49190.31
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999687
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed:
> > 51518.77
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999844
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed:
> > 51551.62
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999844
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed:
> > 49026.35
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999844
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed:
> > 49002.46
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999375
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed:
> > 48980.55
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999687
> > > > > >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed:
> > 47402.56
> > > > > >> samples/sec
> > > > > >> > > > > > >  accuracy=0.999844
> > > > > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999767
> > > > > >> > > > > > > INFO:root:Epoch[19] Time cost=1.259
> > > > > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983181
> > > > > >> > > > > > > 755.36user 754.94system 0:35.89elapsed 4207%CPU
> > > > > >> (0avgtext+0avgdata
> > > > > >> > > > > > > 1147008maxresident)k
> > > > > >> > > > > > > 0inputs+3112outputs (0major+3568826minor)pagefaults
> > 0swaps
> > > > > >> > > > > > >
> > > > > >> > > > > > > Let me know what you think.
> > > > > >> > > > > > >
> > > > > >> > > > > > > Link to the original PR:
> > > > > >> > > > > > > https://github.com/apache/incubator-mxnet/pull/12160
> > > > > >> > > > > > >
> > > > > >> > > > > > > Thanks.
> > > > > >> > > > > > >
> > > > > >> > > > > > > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
> > > > > >> > > > > > > <ke...@gmail.com> wrote:
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > "if you’re linking in two then you’re doing
> > something
> > > > > >> wrong."
> > > > > >> > > > > Correct,
> > > > > >> > > > > > > > that's one thing I believe we've got consensus on.
> > So
> > > > > >> let's call
> > > > > >> > > > > that
> > > > > >> > > > > > > out
> > > > > >> > > > > > > > as a bug to be fixed.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Let's move forward with some reproducible numbers
> > and
> > > > then
> > > > > >> > > discuss
> > > > > >> > > > > the
> > > > > >> > > > > > > pros
> > > > > >> > > > > > > > / cons of which particular OMP implementation we
> > should
> > > > use.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <
> > > > > >> > > > > > > pedro.larroy.lists@gmail.com>
> > > > > >> > > > > > > > wrote:
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > > Hi Chris
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > I would ask you to have a bit of patience and
> > help us
> > > > > >> with your
> > > > > >> > > > > > > > > experience in this matter. Nobody is ignoring
> > > > anything, I
> > > > > >> > > think we
> > > > > >> > > > > are
> > > > > >> > > > > > > > > individually gathering feedbacks and trying to
> > > > understand
> > > > > >> the
> > > > > >> > > > > multiple
> > > > > >> > > > > > > > > contributions done to this topic including yours,
> > > > then go
> > > > > >> step
> > > > > >> > > by
> > > > > >> > > > > > > > > step, understand what is going on and run
> > experiments
> > > > and
> > > > > >> > > report
> > > > > >> > > > > back
> > > > > >> > > > > > > > > to the list or the corresponding github item. It
> > was
> > > > > >> suggested
> > > > > >> > > by
> > > > > >> > > > > > > > > Kellen to prepare some containers, this takes
> > effort.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Regarding your final comment, most of us also have
> > > > many
> > > > > >> other
> > > > > >> > > > > things
> > > > > >> > > > > > > > > to do and responsibilities even if our daytime
> > jobs
> > > > might
> > > > > >> > > involve
> > > > > >> > > > > > > > > MXNet in some form or another. I think that's
> > part of
> > > > the
> > > > > >> > > privilege
> > > > > >> > > > > > > > > and responsibility of working close with an open
> > > > source
> > > > > >> > > project and
> > > > > >> > > > > > > > > the magic of collaboration across organizations.
> > Let's
> > > > > >> all be
> > > > > >> > > > > patient
> > > > > >> > > > > > > > > and take some time to understand and reason about
> > this
> > > > > >> topic
> > > > > >> > > which
> > > > > >> > > > > is
> > > > > >> > > > > > > > > not simple. Since we decided to step back and
> > gather
> > > > more
> > > > > >> data
> > > > > >> > > > > let's
> > > > > >> > > > > > > > > take time and do it properly.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Personally I hope to find time to look again into
> > this
> > > > > >> issue
> > > > > >> > > before
> > > > > >> > > > > > > > > the end of the week.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Thanks.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Pedro.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <
> > > > > >> > > > > cjolivier01@apache.org>
> > > > > >> > > > > > > > > wrote:
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > if you’re linking in two then you’re doing
> > something
> > > > > >> wrong.
> > > > > >> > > You
> > > > > >> > > > > can
> > > > > >> > > > > > > see
> > > > > >> > > > > > > > > by
> > > > > >> > > > > > > > > > my email yesterday that only one is linked in.
> > This
> > > > is
> > > > > >> also
> > > > > >> > > the
> > > > > >> > > > > case
> > > > > >> > > > > > > with
> > > > > >> > > > > > > > > > the mkl version built by the Makefile — only the
> > > > Intel
> > > > > >> OMP
> > > > > >> > > > > library is
> > > > > >> > > > > > > > > used
> > > > > >> > > > > > > > > > (no libgomp).
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > That being said, Do you have clear evidence that
> > > > using
> > > > > >> Intel
> > > > > >> > > OMP
> > > > > >> > > > > is
> > > > > >> > > > > > > both
> > > > > >> > > > > > > > > > problematic and the situation isn’t fixable?
> > The
> > > > > >> burden of
> > > > > >> > > > > proof is
> > > > > >> > > > > > > on
> > > > > >> > > > > > > > > the
> > > > > >> > > > > > > > > > ones requesting the change — it is not my
> > > > > >> responsibility to
> > > > > >> > > > > justify
> > > > > >> > > > > > > the
> > > > > >> > > > > > > > > > current state.  There must be something
> > “terrible”
> > > > and
> > > > > >> > > unfixable
> > > > > >> > > > > to
> > > > > >> > > > > > > > > justify
> > > > > >> > > > > > > > > > a change.  I have seen no proof of this in all
> > this
> > > > > >> time.
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > On a side note, I mentioned a couple of things
> > in my
> > > > > >> email
> > > > > >> > > > > yesterday
> > > > > >> > > > > > > that
> > > > > >> > > > > > > > > > still are not being responded to (they were also
> > > > > >> ignored in
> > > > > >> > > the
> > > > > >> > > > > last
> > > > > >> > > > > > > > > > incarnation of this “discussion” — I have much
> > > > > >> experience in
> > > > > >> > > this
> > > > > >> > > > > > > matter
> > > > > >> > > > > > > > > to
> > > > > >> > > > > > > > > > assume “discussion” is a waste of my time,
> > seeing
> > > > and I
> > > > > >> am
> > > > > >> > > not
> > > > > >> > > > > paid
> > > > > >> > > > > > > to
> > > > > >> > > > > > > > > > “work on” mxnet like y’all are).
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > -C
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen
> > sunderland <
> > > > > >> > > > > > > > > > kellen.sunderland@gmail.com> wrote:
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > > I've also quite often seen two versions of
> > OpenMP
> > > > > >> linked.
> > > > > >> > > I
> > > > > >> > > > > think
> > > > > >> > > > > > > we
> > > > > >> > > > > > > > > can
> > > > > >> > > > > > > > > > > all agree we probably want to avoid linking
> > in two
> > > > > >> > > libraries
> > > > > >> > > > > that
> > > > > >> > > > > > > do
> > > > > >> > > > > > > > > > > effectively the same thing.
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > The performance questions should be fairly
> > > > straight
> > > > > >> > > forward to
> > > > > >> > > > > > > > > demonstrate
> > > > > >> > > > > > > > > > > right?  Could we just collaborate on a few
> > minimal
> > > > > >> > > Dockerfiles
> > > > > >> > > > > that
> > > > > >> > > > > > > > > show
> > > > > >> > > > > > > > > > > (or don't show) Intel OpenMP performance
> > speedups
> > > > > >> with the
> > > > > >> > > > > > > workloads
> > > > > >> > > > > > > > > Chris
> > > > > >> > > > > > > > > > > is referencing?
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov,
> > > > Stanislav <
> > > > > >> > > > > > > > > > > stanislav.tsukrov@gmail.com> wrote:
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > > Hi, Chris!
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > Stas here - I've gathered that performance
> > data.
> > > > > >> > > > > > > > > > > > Sure thing, I can be wrong, but please
> > > > elaborate a
> > > > > >> bit on
> > > > > >> > > > > what
> > > > > >> > > > > > > we are
> > > > > >> > > > > > > > > > > > missing.
> > > > > >> > > > > > > > > > > > Be assured, intentional misdirection was
> > never a
> > > > > >> case.
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > Thanks a lot for being constructive.
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > Turning Intel OMP on and off (and MKL as
> > well,
> > > > > >> since it
> > > > > >> > > > > tends
> > > > > >> > > > > > > to
> > > > > >> > > > > > > > > pull
> > > > > >> > > > > > > > > > > in
> > > > > >> > > > > > > > > > > > omp, depending which one is linked in).
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > We never ever considered turning MKL off. We
> > > > are on
> > > > > >> the
> > > > > >> > > same
> > > > > >> > > > > page
> > > > > >> > > > > > > > > here -
> > > > > >> > > > > > > > > > > > MKL is crucial for the performance.
> > > > > >> > > > > > > > > > > > Why should we? There's a GOMP-linked
> > version of
> > > > MKL,
> > > > > >> > > that we
> > > > > >> > > > > can
> > > > > >> > > > > > > use.
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > What we did - we measured, if using
> > compilers
> > > > > >> default
> > > > > >> > > OpenMP
> > > > > >> > > > > > > > > > > > implementation instead of referenced source
> > code
> > > > > >> > > > > distribution of
> > > > > >> > > > > > > > > OpenMP
> > > > > >> > > > > > > > > > > > makes anything slower.
> > > > > >> > > > > > > > > > > > We have found the impact to be hardly
> > > > measurable.
> > > > > >> > > > > > > > > > > > The difference between GOMP and iOMP is <5%
> > on
> > > > our
> > > > > >> > > > > benchmarks,
> > > > > >> > > > > > > most
> > > > > >> > > > > > > > > of
> > > > > >> > > > > > > > > > > the
> > > > > >> > > > > > > > > > > > time less than that.
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > We just suggest to simplify the build of
> > mxnet,
> > > > by
> > > > > >> > > removing
> > > > > >> > > > > the
> > > > > >> > > > > > > > > > > > unnecessary dependency.
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > During that we discovered for example the
> > > > following
> > > > > >> > > amazing
> > > > > >> > > > > > > issue:
> > > > > >> > > > > > > > > > > >
> > > > > >> https://github.com/apache/incubator-mxnet/issues/14087
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > Best Regards
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > Stas
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > On 18.06.19, 18:24, "Chris Olivier" <
> > > > > >> > > cjolivier01@gmail.com>
> > > > > >> > > > > > > wrote:
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >     I am very reluctant to feed the trolls
> > > > again,
> > > > > >> and
> > > > > >> > > this
> > > > > >> > > > > will
> > > > > >> > > > > > > be
> > > > > >> > > > > > > > > teh
> > > > > >> > > > > > > > > > > last
> > > > > >> > > > > > > > > > > >     time I address Pedro or Anton on the
> > > > subject,
> > > > > >> but
> > > > > >> > > since I
> > > > > >> > > > > > > think
> > > > > >> > > > > > > > > the
> > > > > >> > > > > > > > > > > > numbers
> > > > > >> > > > > > > > > > > >     being presented are incorrect (either
> > by te
> > > > > >> builders
> > > > > >> > > not
> > > > > >> > > > > > > really
> > > > > >> > > > > > > > > > > >     understanding what they are building, or
> > > > > >> possibly
> > > > > >> > > > > intentional
> > > > > >> > > > > > > > > > > > misdirection):
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >     Turning Intel OMP on and off (and MKL as
> > > > well,
> > > > > >> since
> > > > > >> > > it
> > > > > >> > > > > > > tends to
> > > > > >> > > > > > > > > pull
> > > > > >> > > > > > > > > > > > in
> > > > > >> > > > > > > > > > > >     omp, depending which one is linked in).
> > > > > >> > > > > > > > > > > >     There is a HUGE difference.  This is
> > > > consistent
> > > > > >> with
> > > > > >> > > my
> > > > > >> > > > > > > > > experience
> > > > > >> > > > > > > > > > > > before
> > > > > >> > > > > > > > > > > >     when it was added.
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >     default mnist:
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >     python
> > > > > >> ../example/image-classification/train_mnist.py
> > > > > >> > > > > > > > > > > >     INFO:root:start with arguments
> > > > > >> > > Namespace(add_stn=False,
> > > > > >> > > > > > > > > > > batch_size=64,
> > > > > >> > > > > > > > > > > >     disp_batches=100, dtype='float32',
> > > > > >> gc_threshold=0.5,
> > > > > >> > > > > > > > > gc_type='none',
> > > > > >> > > > > > > > > > > >     gpus=None, image_shape='1, 28, 28',
> > > > > >> > > > > initializer='default',
> > > > > >> > > > > > > > > > > >     kv_store='device', load_epoch=None,
> > loss='',
> > > > > >> lr=0.05,
> > > > > >> > > > > > > > > lr_factor=0.1,
> > > > > >> > > > > > > > > > > >     lr_step_epochs='10', macrobatch_size=0,
> > > > > >> > > > > model_prefix=None,
> > > > > >> > > > > > > > > mom=0.9,
> > > > > >> > > > > > > > > > > >     monitor=0, network='mlp',
> > num_classes=10,
> > > > > >> > > num_epochs=20,
> > > > > >> > > > > > > > > > > >     num_examples=60000, num_layers=None,
> > > > > >> optimizer='sgd',
> > > > > >> > > > > > > > > > > >     profile_server_suffix='',
> > > > > >> profile_worker_suffix='',
> > > > > >> > > > > > > > > save_period=1,
> > > > > >> > > > > > > > > > > >     test_io=0, top_k=0, warmup_epochs=5,
> > > > > >> > > > > > > warmup_strategy='linear',
> > > > > >> > > > > > > > > > > > wd=0.0001)
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >     INTEL OMP:
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >     ldd libmxnet.so | grep omp
> > > > > >> > > > > > > > > > > >             libomp.so =>
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > >
> > > > > >>
> > /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > > > > >> > > > > > > > > > > >     (0x00007f978fde7000)
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >     :root:Epoch[0] Batch [0-100]
> > Speed:
> > > > > >> 31548.09
> > > > > >> > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >     accuracy=0.780012
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]
> > > > Speed:
> > > > > >> > > 16073.21
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >     accuracy=0.920469
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]
> > > > Speed:
> > > > > >> > > 19075.91
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >     accuracy=0.928281
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]
> > > > Speed:
> > > > > >> > > 23211.36
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >     accuracy=0.942813
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]
> > > > Speed:
> > > > > >> > > 22139.79
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >     accuracy=0.938750
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]
> > > > Speed:
> > > > > >> > > 23225.52
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >     accuracy=0.946562
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]
> > > > Speed:
> > > > > >> > > 19547.41
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >     accuracy=0.953281
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]
> > > > Speed:
> > > > > >> > > 24111.73
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >     accuracy=0.951562
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]
> > > > Speed:
> > > > > >> > > 13959.88
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >     accuracy=0.957500
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0]
> > Train-accuracy=0.925423
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Time cost=3.806
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0]
> > > > Validation-accuracy=0.962580
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[1] Batch [0-100]
> > > > Speed:
> > > > > >> > > 24560.21
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >     accuracy=0.968131
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[1] Batch [100-200]
> > > > Speed:
> > > > > >> > > 23457.03
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >     accuracy=0.966250
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >     LIBGOMP:
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >     ldd libmxnet.so | grep omp
> > > > > >> > > > > > > > > > > >             libgomp.so.1 =>
> > > > > >> > > > > > > /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > >> > > > > > > > > > > >     (0x00007f25c25dd000)
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [0-100]
> > > > Speed:
> > > > > >> > > 1731.01
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >      accuracy=0.782488
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]
> > > > Speed:
> > > > > >> > > 3551.32
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >      accuracy=0.907813
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]
> > > > Speed:
> > > > > >> > > 1991.00
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >      accuracy=0.927188
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]
> > > > Speed:
> > > > > >> > > 2175.45
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >      accuracy=0.937969
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]
> > > > Speed:
> > > > > >> > > 1644.95
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >      accuracy=0.942187
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]
> > > > Speed:
> > > > > >> > > 6444.58
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >      accuracy=0.950156
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]
> > > > Speed:
> > > > > >> > > 7842.16
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >      accuracy=0.947969
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]
> > > > Speed:
> > > > > >> > > 9412.07
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >      accuracy=0.953750
> > > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]
> > > > Speed:
> > > > > >> > > 12707.58
> > > > > >> > > > > > > > > samples/sec
> > > > > >> > > > > > > > > > > >     accuracy=0.953125
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >     That being said, there's other issued
> > beyond
> > > > > >> speed.
> > > > > >> > > The
> > > > > >> > > > > > > DEFAULT
> > > > > >> > > > > > > > > > > build
> > > > > >> > > > > > > > > > > > from
> > > > > >> > > > > > > > > > > >     makefile (not CMake) uses Intel OMP mkl
> > (I
> > > > > >> showed
> > > > > >> > > > > before) and
> > > > > >> > > > > > > > > > > > mysteriously
> > > > > >> > > > > > > > > > > >     it has no issues?  This seems highly
> > > > suspicious.
> > > > > >> > > All I
> > > > > >> > > > > see
> > > > > >> > > > > > > is a
> > > > > >> > > > > > > > > lot
> > > > > >> > > > > > > > > > > of
> > > > > >> > > > > > > > > > > >     hand-waving and conjecture and pointing
> > to
> > > > > >> > > StackOverflow
> > > > > >> > > > > > > posts
> > > > > >> > > > > > > > > made
> > > > > >> > > > > > > > > > > by
> > > > > >> > > > > > > > > > > >     people who may be of questionable
> > pedigree
> > > > to
> > > > > >> begin
> > > > > >> > > with.
> > > > > >> > > > > > > This
> > > > > >> > > > > > > > > > > smells
> > > > > >> > > > > > > > > > > > of a
> > > > > >> > > > > > > > > > > >     Pedro-ego-fight rather than one of
> > purely
> > > > > >> technical
> > > > > >> > > > > merit.
> > > > > >> > > > > > > > > Also, if
> > > > > >> > > > > > > > > > > > one
> > > > > >> > > > > > > > > > > >     knows how OMP works,  they would be very
> > > > > >> suspicious
> > > > > >> > > of
> > > > > >> > > > > the
> > > > > >> > > > > > > > > > > > "intermittent
> > > > > >> > > > > > > > > > > >     hangs" claim -- that's probably just
> > broken
> > > > race
> > > > > >> > > > > conditions
> > > > > >> > > > > > > > > elsewhere
> > > > > >> > > > > > > > > > > > until
> > > > > >> > > > > > > > > > > >     proven differently.  It'd tend freeze
> > on the
> > > > > >> first
> > > > > >> > > use if
> > > > > >> > > > > > > > > something
> > > > > >> > > > > > > > > > > is
> > > > > >> > > > > > > > > > > >     wrong (try using libgomp after a fork
> > and
> > > > see),
> > > > > >> since
> > > > > >> > > > > worker
> > > > > >> > > > > > > > > threads"
> > > > > >> > > > > > > > > > > >     wouldn't be assigned/joined properly.
> > > > IntelOMP
> > > > > >> is
> > > > > >> > > > > faster,
> > > > > >> > > > > > > but
> > > > > >> > > > > > > > > also
> > > > > >> > > > > > > > > > > has
> > > > > >> > > > > > > > > > > >     other advantages, such as allowing OMP
> > > > after a
> > > > > >> fork.
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >     I actually addressed a lot of issues and
> > > > ask for
> > > > > >> > > > > > > clarification
> > > > > >> > > > > > > > > in the
> > > > > >> > > > > > > > > > > >     original PR's way back when, but
> > they're all
> > > > > >> just
> > > > > >> > > > > ignored.
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >     -Chris
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > >
> > > > > >> > >
> > > > > >>
> > > > > >
> > > >
> > > >
> >
> >

Re: OMP

Posted by Chris Olivier <cj...@gmail.com>.
1) I don't see how that code could cause reentrancy problems in omp. It
doesn't make any OMP calls at all.  Still doesn't look related to me.
Setting an environment variable probably doesn't even do anything, because:
  a) It probably doesn't check the environment variable except at initial
startup
  b) Even if it did, whether this code ran before or after the OMP init
code would be nondeterministic
  c) It for sure doesn't check the environment variable every time it hits
an omp region.  That would be ridiculously expensive and checking the OMP
source code, it doesn't..  You can't affect the OMP behavior at arbitrary
points in time by setting the "OMP_NUM_THREADS" environment variable.




On Tue, Jun 25, 2019 at 1:20 PM Pedro Larroy <pe...@gmail.com>
wrote:

> Nobody claimed that the original lockup has to do with OMP, but the
> fix caused re-entrancy into OMP initialization as explained below. So
> I agree with your statement that the bug that using pthread_atfork was
> fixing is not related with OMP, but the fix is causing interactions
> with OMP as described above.
>
> Pedro.
>
> On Tue, Jun 25, 2019 at 12:33 PM Chris Olivier <cj...@gmail.com>
> wrote:
> >
> > The call stacks there are mostly associated with the execution engine
> > threads, which are not OMP threads.  That lockup doesn't look to me to be
> > related to OMP   -- the execution engine uses its own thread pool logic
> --
> > I'm pretty familiar with that part of the code.  Unless I am missing one
> --
> > can you point to the one that looks OMP-related?
> >
> >
> > On Tue, Jun 25, 2019 at 10:35 AM Pedro Larroy <
> pedro.larroy.lists@gmail.com>
> > wrote:
> >
> > > Thanks for digging that out Kellen. That's good info so maybe it would
> > > be good to rework the fix with the info you provided and remove the
> > > pthread_atfork handlers.
> > > Do you think setting the device would avoid the problem seen on the
> > > backtrace you provided?  specifically here:
> > >
> > >
> https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600#file-hang_bt-L24
> > >
> > > On Mon, Jun 24, 2019 at 6:43 PM kellen sunderland
> > > <ke...@gmail.com> wrote:
> > > >
> > > > I remember at the time we also had a read through of this blog post,
> but
> > > to
> > > > use the code looked like it was following the advice:
> > > >
> > >
> https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
> > > >
> > > > On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland <
> > > > kellen.sunderland@gmail.com> wrote:
> > > >
> > > > > I remember this hang as well, it was pretty hard to reproduce
> IIRC.  I
> > > > > believe the stacks for the hang are here:
> > > > >
> > >
> https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600
> > > and
> > > > > the trick was we could only debug it up to the point that we hit:
> > > > >
> > > > > #0  0x00007fec6df1ba4f in futex_wait (private=0, expected=1,
> > > > > futex_word=0x7fec60843758)
> > > > > at ../sysdeps/unix/sysv/linux/futex-internal.h:61
> > > > > #1  futex_wait_simple (private=0, expected=1,
> > > futex_word=0x7fec60843758)
> > > > >     at ../sysdeps/nptl/futex-internal.h:135
> > > > > #2  __pthread_once_slow (once_control=0x7fec60843758,
> > > > > init_routine=0x7fec605f38f0)
> > > > >     at pthread_once.c:105
> > > > > ...
> > > > > #6  0x00007fec6061c577 in cudaSetDevice () from
> > > > > /usr/local/cuda/lib64/libcudart.so.9.0
> > > > >
> > > > > because the code in libcudart is obviously closed source we
> couldn't
> > > dig
> > > > > into what threading work was going on when we called cudaSetDevice.
> > > > >
> > > > > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy <
> > > pedro.larroy.lists@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> If you check initialize.cc we seem to be explicitly disabling that
> > > > >> behaviour in pthread_at_fork which seems to cause thread
> contention
> > > > >> during multiprocessing. Why do we need this major advantage for
> the
> > > > >> library if that's the case?
> > > > >>
> > > > >> Related PRs:
> > > > >>
> > > > >> https://github.com/apache/incubator-mxnet/pull/10820
> > > > >> https://github.com/apache/incubator-mxnet/issues/14396
> > > > >>
> > > > >> The original code was authored in this PR:
> > > > >>
> > > > >> https://github.com/apache/incubator-mxnet/pull/8677
> > > > >>
> > > > >> I actually remember this fix, it was done during a release as the
> cuda
> > > > >> runtime was forking and the engine was being re-entered. If that
> > > > >> situation is now happening anymore it might not be needed any
> longer.
> > > > >> I don't think we know the cause why there was a fork inside cuda,
> so
> > > > >> the code has grown around a fix for an issue which its root cause
> was
> > > > >> not understood, and side effects which this fix caused afterwards.
> > > > >>
> > > > >> My build uses MKL+LLVM OMP+DEBUG as seen in the container
> provided in
> > > > >> the link above, no libgomp.
> > > > >>
> > > > >> I didn't try the Make build.
> > > > >>
> > > > >> I would refactor the code linked above and stop using
> pthread_at_fork,
> > > > >> since OMP assumes it won't be initialized twice, but needs to be
> very
> > > > >> well tested to make sure it doesn't cause bugs or affect the fixes
> > > > >> done on the linked PRs above.
> > > > >>
> > > > >> Pedro.
> > > > >>
> > > > >> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier <
> cjolivier01@gmail.com>
> > > > >> wrote:
> > > > >> >
> > > > >> > one major advantage of intel/llvm omp is that it spawns a new
> thread
> > > > >> pool
> > > > >> > after fork if a thread pool was already created. this is so
> that omp
> > > > >> can be
> > > > >> > used in the forked processes. libgomp doesn’t do this so it’ll
> just
> > > > >> lock up
> > > > >> > if you try to do omp in the forked process.
> > > > >> >
> > > > >> > is your build linking libgomp as well?
> > > > >> >
> > > > >> > standard mkl build (from Makefile) uses same omp library. are
> there
> > > > >> > problems with that build?
> > > > >> >
> > > > >> > what changes need to be made to make the assertion not fire?
> > > > >> >
> > > > >> > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy <
> > > > >> pedro.larroy.lists@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > There's an assertion which is easily reproducible, and also
> > > there's a
> > > > >> > > crash including core dump, the latter is not easy to reproduce
> > > for me
> > > > >> > > in different environments. I have also seen mxnet getting
> stuck
> > > > >> > > without progressing with this build configuration and using no
> > > CPU at
> > > > >> > > all when running unit tests.
> > > > >> > >
> > > > >> > > In my view, the root cause of the assertion is that we are
> > > re-entering
> > > > >> > > OMP initialization when spawning threads on the following code
> > > through
> > > > >> > > pthread_at_fork
> > > > >> > >
> > > > >> > >
> > > > >>
> > >
> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
> > > > >> > >
> > > > >> > > This causes double initialization of the OMP engine,
> including the
> > > > >> > > assertion which you are asking about,  and I suspect some
> > > additional
> > > > >> > > overhead. That's the shady forking part you are asking for.
> > > > >> > >
> > > > >> > > A question for you: What is the cause of runtime differences
> > > between
> > > > >> > > OMP runtimes? Shouldn't the implementation overhead diminish
> as
> > > > >> > > threads run longer?
> > > > >> > >
> > > > >> > > Pedro.
> > > > >> > >
> > > > >> > > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier <
> > > cjolivier01@gmail.com>
> > > > >> > > wrote:
> > > > >> > > >
> > > > >> > > > What’s the reason for the assertion failure? btw
> classifying an
> > > > >> assertion
> > > > >> > > > failure a “crash” is debatable. As I stated in the original
> > > issue a
> > > > >> long
> > > > >> > > > time ago, it’s possible something shady is being done with
> when
> > > > >> forking
> > > > >> > > > that should be fixed.  The assertion should be root caused.
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <
> > > > >> > > pedro.larroy.lists@gmail.com>
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Added a dockerfile, and reports of a crash in my local
> machine
> > > > >> when
> > > > >> > > > > running MKL+OMP+DEBUG, with Anton's branch the crash
> happened
> > > as
> > > > >> well.
> > > > >> > > > > I couldn't reproduce the crash on my EC2 machine:
> > > > >> > > > > Added the backtrace of the crash as well.
> > > > >> > > > >
> > > > >> > > > > https://github.com/apache/incubator-mxnet/issues/10856
> > > > >> > > > >
> > > > >> > > > > Dockerfile here:
> > > > >> > > > >
> > > > >> > > > > https://github.com/larroy/mxnet_omp
> > > > >> > > > >
> > > > >> > > > > Kind regards.
> > > > >> > > > >
> > > > >> > > > > Pedro.
> > > > >> > > > >
> > > > >> > > > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <
> > > > >> > > marco.g.abreu@gmail.com>
> > > > >> > > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > > As already proposed, I think the easiest way to get a
> common
> > > > >> > > > > understanding
> > > > >> > > > > > is if we start with a few docker containers. Pedro,
> would
> > > it be
> > > > >> > > possible
> > > > >> > > > > > for you to wrap your benchmarks into a few containers
> that
> > > will
> > > > >> > > produce
> > > > >> > > > > > your shown results? That way, we can avoid possible
> > > > >> > > misunderstandings and
> > > > >> > > > > > also pinpoint the exact parts where people disagree or
> > > > >> misunderstood
> > > > >> > > each
> > > > >> > > > > > other.
> > > > >> > > > > >
> > > > >> > > > > > -Marco
> > > > >> > > > > >
> > > > >> > > > > > Pedro Larroy <pe...@gmail.com> schrieb am
> Do.,
> > > > >> 20. Juni
> > > > >> > > > > 2019,
> > > > >> > > > > > 21:47:
> > > > >> > > > > >
> > > > >> > > > > > > I can confirm that we are linking with two versions of
> > > omp,
> > > > >> I'm
> > > > >> > > > > > > gaining more clarity into this topic, but I have still
> > > > >> questions,
> > > > >> > > the
> > > > >> > > > > > > facts that I got so far are the folllowing:
> > > > >> > > > > > >
> > > > >> > > > > > > * #1: We are linking with two versions of omp,
> intel's omp
> > > > >> and llvm
> > > > >> > > > > > > openmp when building with MKL enabled.
> > > > >> > > > > > > * #2: We have 3 different possible OMP versions:
> Intel OMP
> > > > >> (comes
> > > > >> > > with
> > > > >> > > > > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes
> with
> > > gcc)
> > > > >> (This
> > > > >> > > > > > > one is used on the PR proposed by Anton).
> > > > >> > > > > > >
> > > > >> > > > > > > Questions:
> > > > >> > > > > > >
> > > > >> > > > > > >  * #1 Is it ok to have two versions of openmp linked
> at
> > > the
> > > > >> same
> > > > >> > > time?
> > > > >> > > > > > >  * #2 Which implementation of OMP gives the best
> > > > >> performance?  (See
> > > > >> > > > > > > total training time of my measurement for a partial
> > > answer)
> > > > >> > > > > > >  * #3 Should we have a build flag so we can choose
> the OMP
> > > > >> version
> > > > >> > > at
> > > > >> > > > > > > runtime?
> > > > >> > > > > > >  * #4 Which Compiler and build flags did Chris use to
> get
> > > 10x
> > > > >> > > slowdown?
> > > > >> > > > > > >  * #5 @Stas: is there a script to replicate your
> > > benchmarks
> > > > >> > > easily? If
> > > > >> > > > > > > so could you provide a link?  I think we would need to
> > > > >> reproduce
> > > > >> > > your
> > > > >> > > > > > > benchmarks and verify which versions are being linked.
> > > It's
> > > > >> > > possible
> > > > >> > > > > > > that while compiling with MKL intel's omp was pulled
> in
> > > > >> instead of
> > > > >> > > > > > > GNU OpenMP.
> > > > >> > > > > > >  * #6 @Chris: how to maintain the copy of LLVM's
> Openmp?
> > > > >> Should we
> > > > >> > > > > > > update the subrepo regularly?
> > > > >> > > > > > >
> > > > >> > > > > > > My conclusion so far:
> > > > >> > > > > > >
> > > > >> > > > > > >  * #1 We should avoid linking two versions of omp if
> > > possible
> > > > >> and
> > > > >> > > > > > > allow users to choose one in the build as we do for
> BLAS.
> > > > >> > > > > > >  * #2 For performance reasons and more control vs
> > > different
> > > > >> > > compiler
> > > > >> > > > > > > versions seems it makes indeed sense to keep the LLVM
> > > OpenMP
> > > > >> > > version
> > > > >> > > > > > > in 3rdparty for now. So unless some more data is
> > > gathered, it
> > > > >> makes
> > > > >> > > > > > > sense not to remove it as of now.
> > > > >> > > > > > >  * #3 We should provide build options to choose which
> > > openmp
> > > > >> > > library
> > > > >> > > > > > > is to be used from the three options available,
> including
> > > > >> libgomp.
> > > > >> > > > > > >  * #4 Refining the build we could also enable OpenMP
> in
> > > mac
> > > > >> without
> > > > >> > > > > > > additional contortions (doesn't work as of today):
> > > > >> > > > > > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
> > > > >> > > > > > >  * #5 We should add different omp versions to our
> > > benchmarks
> > > > >> and
> > > > >> > > track
> > > > >> > > > > > > the performance, so this data is available for
> prescribing
> > > > >> the best
> > > > >> > > > > > > build options and for binary releases.
> > > > >> > > > > > >
> > > > >> > > > > > > This is also an interesting related gh issue posted
> in the
> > > > >> mkl-dnn
> > > > >> > > > > > > repository:
> https://github.com/intel/mkl-dnn/issues/230
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > I don't observe the order of magnitude divergence
> > > reported by
> > > > >> > > Chris in
> > > > >> > > > > > > vanilla Ubuntu 18.04 in samples / s but the full
> training
> > > > >> finishes
> > > > >> > > > > > > indeed faster with the OMP from 3rdparty (LLVM
> openmp) vs
> > > > >> libgomp.
> > > > >> > > > > > >
> > > > >> > > > > > > There's also differences in training time when using
> MKL
> > > and
> > > > >> the ,
> > > > >> > > > > > > it's actually a bit slower, I don't know if it's
> related
> > > to
> > > > >> OMP.
> > > > >> > > > > > >
> > > > >> > > > > > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
> > > > >> > > > > > >
> > > > >> > > > > > > Anton's branch:  git@github.com:
> lebeg/incubator-mxnet.git
> > > > >>  branch
> > > > >> > > > > 'omp'
> > > > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> > > > >> > > > > > > build/libmxnet.so |grep -i omp
> > > > >> > > > > > >         libgomp.so.1 =>
> > > /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > >> > > > > > > (0x00007fd99a51d000)
> > > > >> > > > > > >
> > > > >> > > > > > > time python train_mnist.py
> > > > >> > > > > > >
> > > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.984176
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed:
> 41617.00
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=1.000000
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed:
> 47990.69
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999531
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed:
> 47517.01
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999687
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed:
> 47430.53
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=1.000000
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed:
> 47649.77
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999687
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed:
> 51708.12
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999687
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed:
> 57228.63
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999375
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed:
> 50887.85
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999844
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed:
> 53947.98
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999531
> > > > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > > >> > > > > > > INFO:root:Epoch[19] Time cost=1.219
> > > > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983977
> > > > >> > > > > > > 1011.98user 26.78system 0:31.54elapsed 3292%CPU
> > > > >> (0avgtext+0avgdata
> > > > >> > > > > > > 1146052maxresident)k
> > > > >> > > > > > > 0inputs+0outputs (0major+3496364minor)pagefaults
> 0swaps
> > > > >> > > > > > >
> > > > >> > > > > > > Master, MKL ON:
> > > > >> > > > > > >
> > > > >> > > > > > > (py3_venv) piotr@ec2 cpu:1:
> ~/m/e/image-classification
> > > > >> [master]>
> > > > >> > > ldd
> > > > >> > > > > > > ../../build/libmxnet.so | grep -i omp
> > > > >> > > > > > >         libomp.so =>
> > > > >> > > > > > >
> > > > >> > >
> > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > > >> > > > > > > (0x00007f05ba38f000)
> > > > >> > > > > > >         libiomp5.so =>
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > >
> > > > >> > >
> > > > >>
> > >
> /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
> > > > >> > > > > > > (0x00007f05b09f4000)
> > > > >> > > > > > >
> > > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.982484
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed:
> 36651.63
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999691
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed:
> 45093.98
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999844
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed:
> 45146.84
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999687
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed:
> 45119.90
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999687
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed:
> 44998.96
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999531
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed:
> 45072.25
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999844
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed:
> 44969.79
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999844
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed:
> 44962.78
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999844
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed:
> 44945.47
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999375
> > > > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > > >> > > > > > > INFO:root:Epoch[19] Time cost=1.367
> > > > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.982783
> > > > >> > > > > > > 854.97user 847.21system 0:41.44elapsed 4106%CPU
> > > > >> (0avgtext+0avgdata
> > > > >> > > > > > > 1154348maxresident)k
> > > > >> > > > > > > 0inputs+0outputs (0major+3624361minor)pagefaults
> 0swaps
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > MKL OFF:
> > > > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]>
> > > grep -i
> > > > >> MKL
> > > > >> > > > > > > cmake_options.yml
> > > > >> > > > > > > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
> > > > >> > > > > > > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if
> MKL
> > > > >> found) IF
> > > > >> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > >> > > > > > > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL
> > > found)
> > > > >> IF
> > > > >> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]>
> ldd
> > > > >> > > > > > > build/libmxnet.so |grep -i omp
> > > > >> > > > > > >         libomp.so =>
> > > > >> > > > > > >
> > > > >> > >
> > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > > >> > > > > > > (0x00007fb720c54000)
> > > > >> > > > > > >
> > > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.983479
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed:
> 46784.02
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=1.000000
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed:
> 48824.29
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999687
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed:
> 49190.31
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999687
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed:
> 51518.77
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999844
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed:
> 51551.62
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999844
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed:
> 49026.35
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999844
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed:
> 49002.46
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999375
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed:
> 48980.55
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999687
> > > > >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed:
> 47402.56
> > > > >> samples/sec
> > > > >> > > > > > >  accuracy=0.999844
> > > > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999767
> > > > >> > > > > > > INFO:root:Epoch[19] Time cost=1.259
> > > > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983181
> > > > >> > > > > > > 755.36user 754.94system 0:35.89elapsed 4207%CPU
> > > > >> (0avgtext+0avgdata
> > > > >> > > > > > > 1147008maxresident)k
> > > > >> > > > > > > 0inputs+3112outputs (0major+3568826minor)pagefaults
> 0swaps
> > > > >> > > > > > >
> > > > >> > > > > > > Let me know what you think.
> > > > >> > > > > > >
> > > > >> > > > > > > Link to the original PR:
> > > > >> > > > > > > https://github.com/apache/incubator-mxnet/pull/12160
> > > > >> > > > > > >
> > > > >> > > > > > > Thanks.
> > > > >> > > > > > >
> > > > >> > > > > > > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
> > > > >> > > > > > > <ke...@gmail.com> wrote:
> > > > >> > > > > > > >
> > > > >> > > > > > > > "if you’re linking in two then you’re doing
> something
> > > > >> wrong."
> > > > >> > > > > Correct,
> > > > >> > > > > > > > that's one thing I believe we've got consensus on.
> So
> > > > >> let's call
> > > > >> > > > > that
> > > > >> > > > > > > out
> > > > >> > > > > > > > as a bug to be fixed.
> > > > >> > > > > > > >
> > > > >> > > > > > > > Let's move forward with some reproducible numbers
> and
> > > then
> > > > >> > > discuss
> > > > >> > > > > the
> > > > >> > > > > > > pros
> > > > >> > > > > > > > / cons of which particular OMP implementation we
> should
> > > use.
> > > > >> > > > > > > >
> > > > >> > > > > > > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <
> > > > >> > > > > > > pedro.larroy.lists@gmail.com>
> > > > >> > > > > > > > wrote:
> > > > >> > > > > > > >
> > > > >> > > > > > > > > Hi Chris
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > I would ask you to have a bit of patience and
> help us
> > > > >> with your
> > > > >> > > > > > > > > experience in this matter. Nobody is ignoring
> > > anything, I
> > > > >> > > think we
> > > > >> > > > > are
> > > > >> > > > > > > > > individually gathering feedbacks and trying to
> > > understand
> > > > >> the
> > > > >> > > > > multiple
> > > > >> > > > > > > > > contributions done to this topic including yours,
> > > then go
> > > > >> step
> > > > >> > > by
> > > > >> > > > > > > > > step, understand what is going on and run
> experiments
> > > and
> > > > >> > > report
> > > > >> > > > > back
> > > > >> > > > > > > > > to the list or the corresponding github item. It
> was
> > > > >> suggested
> > > > >> > > by
> > > > >> > > > > > > > > Kellen to prepare some containers, this takes
> effort.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Regarding your final comment, most of us also have
> > > many
> > > > >> other
> > > > >> > > > > things
> > > > >> > > > > > > > > to do and responsibilities even if our daytime
> jobs
> > > might
> > > > >> > > involve
> > > > >> > > > > > > > > MXNet in some form or another. I think that's
> part of
> > > the
> > > > >> > > privilege
> > > > >> > > > > > > > > and responsibility of working close with an open
> > > source
> > > > >> > > project and
> > > > >> > > > > > > > > the magic of collaboration across organizations.
> Let's
> > > > >> all be
> > > > >> > > > > patient
> > > > >> > > > > > > > > and take some time to understand and reason about
> this
> > > > >> topic
> > > > >> > > which
> > > > >> > > > > is
> > > > >> > > > > > > > > not simple. Since we decided to step back and
> gather
> > > more
> > > > >> data
> > > > >> > > > > let's
> > > > >> > > > > > > > > take time and do it properly.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Personally I hope to find time to look again into
> this
> > > > >> issue
> > > > >> > > before
> > > > >> > > > > > > > > the end of the week.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Thanks.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Pedro.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <
> > > > >> > > > > cjolivier01@apache.org>
> > > > >> > > > > > > > > wrote:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > if you’re linking in two then you’re doing
> something
> > > > >> wrong.
> > > > >> > > You
> > > > >> > > > > can
> > > > >> > > > > > > see
> > > > >> > > > > > > > > by
> > > > >> > > > > > > > > > my email yesterday that only one is linked in.
> This
> > > is
> > > > >> also
> > > > >> > > the
> > > > >> > > > > case
> > > > >> > > > > > > with
> > > > >> > > > > > > > > > the mkl version built by the Makefile — only the
> > > Intel
> > > > >> OMP
> > > > >> > > > > library is
> > > > >> > > > > > > > > used
> > > > >> > > > > > > > > > (no libgomp).
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > That being said, Do you have clear evidence that
> > > using
> > > > >> Intel
> > > > >> > > OMP
> > > > >> > > > > is
> > > > >> > > > > > > both
> > > > >> > > > > > > > > > problematic and the situation isn’t fixable?
> The
> > > > >> burden of
> > > > >> > > > > proof is
> > > > >> > > > > > > on
> > > > >> > > > > > > > > the
> > > > >> > > > > > > > > > ones requesting the change — it is not my
> > > > >> responsibility to
> > > > >> > > > > justify
> > > > >> > > > > > > the
> > > > >> > > > > > > > > > current state.  There must be something
> “terrible”
> > > and
> > > > >> > > unfixable
> > > > >> > > > > to
> > > > >> > > > > > > > > justify
> > > > >> > > > > > > > > > a change.  I have seen no proof of this in all
> this
> > > > >> time.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > On a side note, I mentioned a couple of things
> in my
> > > > >> email
> > > > >> > > > > yesterday
> > > > >> > > > > > > that
> > > > >> > > > > > > > > > still are not being responded to (they were also
> > > > >> ignored in
> > > > >> > > the
> > > > >> > > > > last
> > > > >> > > > > > > > > > incarnation of this “discussion” — I have much
> > > > >> experience in
> > > > >> > > this
> > > > >> > > > > > > matter
> > > > >> > > > > > > > > to
> > > > >> > > > > > > > > > assume “discussion” is a waste of my time,
> seeing
> > > and I
> > > > >> am
> > > > >> > > not
> > > > >> > > > > paid
> > > > >> > > > > > > to
> > > > >> > > > > > > > > > “work on” mxnet like y’all are).
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > -C
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen
> sunderland <
> > > > >> > > > > > > > > > kellen.sunderland@gmail.com> wrote:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > > I've also quite often seen two versions of
> OpenMP
> > > > >> linked.
> > > > >> > > I
> > > > >> > > > > think
> > > > >> > > > > > > we
> > > > >> > > > > > > > > can
> > > > >> > > > > > > > > > > all agree we probably want to avoid linking
> in two
> > > > >> > > libraries
> > > > >> > > > > that
> > > > >> > > > > > > do
> > > > >> > > > > > > > > > > effectively the same thing.
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > The performance questions should be fairly
> > > straight
> > > > >> > > forward to
> > > > >> > > > > > > > > demonstrate
> > > > >> > > > > > > > > > > right?  Could we just collaborate on a few
> minimal
> > > > >> > > Dockerfiles
> > > > >> > > > > that
> > > > >> > > > > > > > > show
> > > > >> > > > > > > > > > > (or don't show) Intel OpenMP performance
> speedups
> > > > >> with the
> > > > >> > > > > > > workloads
> > > > >> > > > > > > > > Chris
> > > > >> > > > > > > > > > > is referencing?
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov,
> > > Stanislav <
> > > > >> > > > > > > > > > > stanislav.tsukrov@gmail.com> wrote:
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > > Hi, Chris!
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > Stas here - I've gathered that performance
> data.
> > > > >> > > > > > > > > > > > Sure thing, I can be wrong, but please
> > > elaborate a
> > > > >> bit on
> > > > >> > > > > what
> > > > >> > > > > > > we are
> > > > >> > > > > > > > > > > > missing.
> > > > >> > > > > > > > > > > > Be assured, intentional misdirection was
> never a
> > > > >> case.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > Thanks a lot for being constructive.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > Turning Intel OMP on and off (and MKL as
> well,
> > > > >> since it
> > > > >> > > > > tends
> > > > >> > > > > > > to
> > > > >> > > > > > > > > pull
> > > > >> > > > > > > > > > > in
> > > > >> > > > > > > > > > > > omp, depending which one is linked in).
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > We never ever considered turning MKL off. We
> > > are on
> > > > >> the
> > > > >> > > same
> > > > >> > > > > page
> > > > >> > > > > > > > > here -
> > > > >> > > > > > > > > > > > MKL is crucial for the performance.
> > > > >> > > > > > > > > > > > Why should we? There's a GOMP-linked
> version of
> > > MKL,
> > > > >> > > that we
> > > > >> > > > > can
> > > > >> > > > > > > use.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > What we did - we measured, if using
> compilers
> > > > >> default
> > > > >> > > OpenMP
> > > > >> > > > > > > > > > > > implementation instead of referenced source
> code
> > > > >> > > > > distribution of
> > > > >> > > > > > > > > OpenMP
> > > > >> > > > > > > > > > > > makes anything slower.
> > > > >> > > > > > > > > > > > We have found the impact to be hardly
> > > measurable.
> > > > >> > > > > > > > > > > > The difference between GOMP and iOMP is <5%
> on
> > > our
> > > > >> > > > > benchmarks,
> > > > >> > > > > > > most
> > > > >> > > > > > > > > of
> > > > >> > > > > > > > > > > the
> > > > >> > > > > > > > > > > > time less than that.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > We just suggest to simplify the build of
> mxnet,
> > > by
> > > > >> > > removing
> > > > >> > > > > the
> > > > >> > > > > > > > > > > > unnecessary dependency.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > During that we discovered for example the
> > > following
> > > > >> > > amazing
> > > > >> > > > > > > issue:
> > > > >> > > > > > > > > > > >
> > > > >> https://github.com/apache/incubator-mxnet/issues/14087
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > Best Regards
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > Stas
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > On 18.06.19, 18:24, "Chris Olivier" <
> > > > >> > > cjolivier01@gmail.com>
> > > > >> > > > > > > wrote:
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >     I am very reluctant to feed the trolls
> > > again,
> > > > >> and
> > > > >> > > this
> > > > >> > > > > will
> > > > >> > > > > > > be
> > > > >> > > > > > > > > teh
> > > > >> > > > > > > > > > > last
> > > > >> > > > > > > > > > > >     time I address Pedro or Anton on the
> > > subject,
> > > > >> but
> > > > >> > > since I
> > > > >> > > > > > > think
> > > > >> > > > > > > > > the
> > > > >> > > > > > > > > > > > numbers
> > > > >> > > > > > > > > > > >     being presented are incorrect (either
> by te
> > > > >> builders
> > > > >> > > not
> > > > >> > > > > > > really
> > > > >> > > > > > > > > > > >     understanding what they are building, or
> > > > >> possibly
> > > > >> > > > > intentional
> > > > >> > > > > > > > > > > > misdirection):
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >     Turning Intel OMP on and off (and MKL as
> > > well,
> > > > >> since
> > > > >> > > it
> > > > >> > > > > > > tends to
> > > > >> > > > > > > > > pull
> > > > >> > > > > > > > > > > > in
> > > > >> > > > > > > > > > > >     omp, depending which one is linked in).
> > > > >> > > > > > > > > > > >     There is a HUGE difference.  This is
> > > consistent
> > > > >> with
> > > > >> > > my
> > > > >> > > > > > > > > experience
> > > > >> > > > > > > > > > > > before
> > > > >> > > > > > > > > > > >     when it was added.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >     default mnist:
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >     python
> > > > >> ../example/image-classification/train_mnist.py
> > > > >> > > > > > > > > > > >     INFO:root:start with arguments
> > > > >> > > Namespace(add_stn=False,
> > > > >> > > > > > > > > > > batch_size=64,
> > > > >> > > > > > > > > > > >     disp_batches=100, dtype='float32',
> > > > >> gc_threshold=0.5,
> > > > >> > > > > > > > > gc_type='none',
> > > > >> > > > > > > > > > > >     gpus=None, image_shape='1, 28, 28',
> > > > >> > > > > initializer='default',
> > > > >> > > > > > > > > > > >     kv_store='device', load_epoch=None,
> loss='',
> > > > >> lr=0.05,
> > > > >> > > > > > > > > lr_factor=0.1,
> > > > >> > > > > > > > > > > >     lr_step_epochs='10', macrobatch_size=0,
> > > > >> > > > > model_prefix=None,
> > > > >> > > > > > > > > mom=0.9,
> > > > >> > > > > > > > > > > >     monitor=0, network='mlp',
> num_classes=10,
> > > > >> > > num_epochs=20,
> > > > >> > > > > > > > > > > >     num_examples=60000, num_layers=None,
> > > > >> optimizer='sgd',
> > > > >> > > > > > > > > > > >     profile_server_suffix='',
> > > > >> profile_worker_suffix='',
> > > > >> > > > > > > > > save_period=1,
> > > > >> > > > > > > > > > > >     test_io=0, top_k=0, warmup_epochs=5,
> > > > >> > > > > > > warmup_strategy='linear',
> > > > >> > > > > > > > > > > > wd=0.0001)
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >     INTEL OMP:
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >     ldd libmxnet.so | grep omp
> > > > >> > > > > > > > > > > >             libomp.so =>
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > >
> > > > >>
> /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > > > >> > > > > > > > > > > >     (0x00007f978fde7000)
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >     :root:Epoch[0] Batch [0-100]
> Speed:
> > > > >> 31548.09
> > > > >> > > > > > > samples/sec
> > > > >> > > > > > > > > > > >     accuracy=0.780012
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]
> > > Speed:
> > > > >> > > 16073.21
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >     accuracy=0.920469
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]
> > > Speed:
> > > > >> > > 19075.91
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >     accuracy=0.928281
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]
> > > Speed:
> > > > >> > > 23211.36
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >     accuracy=0.942813
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]
> > > Speed:
> > > > >> > > 22139.79
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >     accuracy=0.938750
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]
> > > Speed:
> > > > >> > > 23225.52
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >     accuracy=0.946562
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]
> > > Speed:
> > > > >> > > 19547.41
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >     accuracy=0.953281
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]
> > > Speed:
> > > > >> > > 24111.73
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >     accuracy=0.951562
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]
> > > Speed:
> > > > >> > > 13959.88
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >     accuracy=0.957500
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0]
> Train-accuracy=0.925423
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Time cost=3.806
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0]
> > > Validation-accuracy=0.962580
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[1] Batch [0-100]
> > > Speed:
> > > > >> > > 24560.21
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >     accuracy=0.968131
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[1] Batch [100-200]
> > > Speed:
> > > > >> > > 23457.03
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >     accuracy=0.966250
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >     LIBGOMP:
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >     ldd libmxnet.so | grep omp
> > > > >> > > > > > > > > > > >             libgomp.so.1 =>
> > > > >> > > > > > > /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > >> > > > > > > > > > > >     (0x00007f25c25dd000)
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [0-100]
> > > Speed:
> > > > >> > > 1731.01
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >      accuracy=0.782488
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]
> > > Speed:
> > > > >> > > 3551.32
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >      accuracy=0.907813
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]
> > > Speed:
> > > > >> > > 1991.00
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >      accuracy=0.927188
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]
> > > Speed:
> > > > >> > > 2175.45
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >      accuracy=0.937969
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]
> > > Speed:
> > > > >> > > 1644.95
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >      accuracy=0.942187
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]
> > > Speed:
> > > > >> > > 6444.58
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >      accuracy=0.950156
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]
> > > Speed:
> > > > >> > > 7842.16
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >      accuracy=0.947969
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]
> > > Speed:
> > > > >> > > 9412.07
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >      accuracy=0.953750
> > > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]
> > > Speed:
> > > > >> > > 12707.58
> > > > >> > > > > > > > > samples/sec
> > > > >> > > > > > > > > > > >     accuracy=0.953125
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >     That being said, there's other issued
> beyond
> > > > >> speed.
> > > > >> > > The
> > > > >> > > > > > > DEFAULT
> > > > >> > > > > > > > > > > build
> > > > >> > > > > > > > > > > > from
> > > > >> > > > > > > > > > > >     makefile (not CMake) uses Intel OMP mkl
> (I
> > > > >> showed
> > > > >> > > > > before) and
> > > > >> > > > > > > > > > > > mysteriously
> > > > >> > > > > > > > > > > >     it has no issues?  This seems highly
> > > suspicious.
> > > > >> > > All I
> > > > >> > > > > see
> > > > >> > > > > > > is a
> > > > >> > > > > > > > > lot
> > > > >> > > > > > > > > > > of
> > > > >> > > > > > > > > > > >     hand-waving and conjecture and pointing
> to
> > > > >> > > StackOverflow
> > > > >> > > > > > > posts
> > > > >> > > > > > > > > made
> > > > >> > > > > > > > > > > by
> > > > >> > > > > > > > > > > >     people who may be of questionable
> pedigree
> > > to
> > > > >> begin
> > > > >> > > with.
> > > > >> > > > > > > This
> > > > >> > > > > > > > > > > smells
> > > > >> > > > > > > > > > > > of a
> > > > >> > > > > > > > > > > >     Pedro-ego-fight rather than one of
> purely
> > > > >> technical
> > > > >> > > > > merit.
> > > > >> > > > > > > > > Also, if
> > > > >> > > > > > > > > > > > one
> > > > >> > > > > > > > > > > >     knows how OMP works,  they would be very
> > > > >> suspicious
> > > > >> > > of
> > > > >> > > > > the
> > > > >> > > > > > > > > > > > "intermittent
> > > > >> > > > > > > > > > > >     hangs" claim -- that's probably just
> broken
> > > race
> > > > >> > > > > conditions
> > > > >> > > > > > > > > elsewhere
> > > > >> > > > > > > > > > > > until
> > > > >> > > > > > > > > > > >     proven differently.  It'd tend freeze
> on the
> > > > >> first
> > > > >> > > use if
> > > > >> > > > > > > > > something
> > > > >> > > > > > > > > > > is
> > > > >> > > > > > > > > > > >     wrong (try using libgomp after a fork
> and
> > > see),
> > > > >> since
> > > > >> > > > > worker
> > > > >> > > > > > > > > threads"
> > > > >> > > > > > > > > > > >     wouldn't be assigned/joined properly.
> > > IntelOMP
> > > > >> is
> > > > >> > > > > faster,
> > > > >> > > > > > > but
> > > > >> > > > > > > > > also
> > > > >> > > > > > > > > > > has
> > > > >> > > > > > > > > > > >     other advantages, such as allowing OMP
> > > after a
> > > > >> fork.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >     I actually addressed a lot of issues and
> > > ask for
> > > > >> > > > > > > clarification
> > > > >> > > > > > > > > in the
> > > > >> > > > > > > > > > > >     original PR's way back when, but
> they're all
> > > > >> just
> > > > >> > > > > ignored.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >     -Chris
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > >
> > > > >> > >
> > > > >>
> > > > >
> > >
> > >
>
>

Re: OMP

Posted by Pedro Larroy <pe...@gmail.com>.
Nobody claimed that the original lockup has to do with OMP, but the
fix caused re-entrancy into OMP initialization as explained below. So
I agree with your statement that the bug that using pthread_atfork was
fixing is not related with OMP, but the fix is causing interactions
with OMP as described above.

Pedro.

On Tue, Jun 25, 2019 at 12:33 PM Chris Olivier <cj...@gmail.com> wrote:
>
> The call stacks there are mostly associated with the execution engine
> threads, which are not OMP threads.  That lockup doesn't look to me to be
> related to OMP   -- the execution engine uses its own thread pool logic --
> I'm pretty familiar with that part of the code.  Unless I am missing one --
> can you point to the one that looks OMP-related?
>
>
> On Tue, Jun 25, 2019 at 10:35 AM Pedro Larroy <pe...@gmail.com>
> wrote:
>
> > Thanks for digging that out Kellen. That's good info so maybe it would
> > be good to rework the fix with the info you provided and remove the
> > pthread_atfork handlers.
> > Do you think setting the device would avoid the problem seen on the
> > backtrace you provided?  specifically here:
> >
> > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600#file-hang_bt-L24
> >
> > On Mon, Jun 24, 2019 at 6:43 PM kellen sunderland
> > <ke...@gmail.com> wrote:
> > >
> > > I remember at the time we also had a read through of this blog post, but
> > to
> > > use the code looked like it was following the advice:
> > >
> > https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
> > >
> > > On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland <
> > > kellen.sunderland@gmail.com> wrote:
> > >
> > > > I remember this hang as well, it was pretty hard to reproduce IIRC.  I
> > > > believe the stacks for the hang are here:
> > > >
> > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600
> > and
> > > > the trick was we could only debug it up to the point that we hit:
> > > >
> > > > #0  0x00007fec6df1ba4f in futex_wait (private=0, expected=1,
> > > > futex_word=0x7fec60843758)
> > > > at ../sysdeps/unix/sysv/linux/futex-internal.h:61
> > > > #1  futex_wait_simple (private=0, expected=1,
> > futex_word=0x7fec60843758)
> > > >     at ../sysdeps/nptl/futex-internal.h:135
> > > > #2  __pthread_once_slow (once_control=0x7fec60843758,
> > > > init_routine=0x7fec605f38f0)
> > > >     at pthread_once.c:105
> > > > ...
> > > > #6  0x00007fec6061c577 in cudaSetDevice () from
> > > > /usr/local/cuda/lib64/libcudart.so.9.0
> > > >
> > > > because the code in libcudart is obviously closed source we couldn't
> > dig
> > > > into what threading work was going on when we called cudaSetDevice.
> > > >
> > > > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy <
> > pedro.larroy.lists@gmail.com>
> > > > wrote:
> > > >
> > > >> If you check initialize.cc we seem to be explicitly disabling that
> > > >> behaviour in pthread_at_fork which seems to cause thread contention
> > > >> during multiprocessing. Why do we need this major advantage for the
> > > >> library if that's the case?
> > > >>
> > > >> Related PRs:
> > > >>
> > > >> https://github.com/apache/incubator-mxnet/pull/10820
> > > >> https://github.com/apache/incubator-mxnet/issues/14396
> > > >>
> > > >> The original code was authored in this PR:
> > > >>
> > > >> https://github.com/apache/incubator-mxnet/pull/8677
> > > >>
> > > >> I actually remember this fix, it was done during a release as the cuda
> > > >> runtime was forking and the engine was being re-entered. If that
> > > >> situation is now happening anymore it might not be needed any longer.
> > > >> I don't think we know the cause why there was a fork inside cuda, so
> > > >> the code has grown around a fix for an issue which its root cause was
> > > >> not understood, and side effects which this fix caused afterwards.
> > > >>
> > > >> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in
> > > >> the link above, no libgomp.
> > > >>
> > > >> I didn't try the Make build.
> > > >>
> > > >> I would refactor the code linked above and stop using pthread_at_fork,
> > > >> since OMP assumes it won't be initialized twice, but needs to be very
> > > >> well tested to make sure it doesn't cause bugs or affect the fixes
> > > >> done on the linked PRs above.
> > > >>
> > > >> Pedro.
> > > >>
> > > >> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier <cj...@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> > one major advantage of intel/llvm omp is that it spawns a new thread
> > > >> pool
> > > >> > after fork if a thread pool was already created. this is so that omp
> > > >> can be
> > > >> > used in the forked processes. libgomp doesn’t do this so it’ll just
> > > >> lock up
> > > >> > if you try to do omp in the forked process.
> > > >> >
> > > >> > is your build linking libgomp as well?
> > > >> >
> > > >> > standard mkl build (from Makefile) uses same omp library. are there
> > > >> > problems with that build?
> > > >> >
> > > >> > what changes need to be made to make the assertion not fire?
> > > >> >
> > > >> > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy <
> > > >> pedro.larroy.lists@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > There's an assertion which is easily reproducible, and also
> > there's a
> > > >> > > crash including core dump, the latter is not easy to reproduce
> > for me
> > > >> > > in different environments. I have also seen mxnet getting stuck
> > > >> > > without progressing with this build configuration and using no
> > CPU at
> > > >> > > all when running unit tests.
> > > >> > >
> > > >> > > In my view, the root cause of the assertion is that we are
> > re-entering
> > > >> > > OMP initialization when spawning threads on the following code
> > through
> > > >> > > pthread_at_fork
> > > >> > >
> > > >> > >
> > > >>
> > https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
> > > >> > >
> > > >> > > This causes double initialization of the OMP engine, including the
> > > >> > > assertion which you are asking about,  and I suspect some
> > additional
> > > >> > > overhead. That's the shady forking part you are asking for.
> > > >> > >
> > > >> > > A question for you: What is the cause of runtime differences
> > between
> > > >> > > OMP runtimes? Shouldn't the implementation overhead diminish as
> > > >> > > threads run longer?
> > > >> > >
> > > >> > > Pedro.
> > > >> > >
> > > >> > > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier <
> > cjolivier01@gmail.com>
> > > >> > > wrote:
> > > >> > > >
> > > >> > > > What’s the reason for the assertion failure? btw classifying an
> > > >> assertion
> > > >> > > > failure a “crash” is debatable. As I stated in the original
> > issue a
> > > >> long
> > > >> > > > time ago, it’s possible something shady is being done with when
> > > >> forking
> > > >> > > > that should be fixed.  The assertion should be root caused.
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <
> > > >> > > pedro.larroy.lists@gmail.com>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Added a dockerfile, and reports of a crash in my local machine
> > > >> when
> > > >> > > > > running MKL+OMP+DEBUG, with Anton's branch the crash happened
> > as
> > > >> well.
> > > >> > > > > I couldn't reproduce the crash on my EC2 machine:
> > > >> > > > > Added the backtrace of the crash as well.
> > > >> > > > >
> > > >> > > > > https://github.com/apache/incubator-mxnet/issues/10856
> > > >> > > > >
> > > >> > > > > Dockerfile here:
> > > >> > > > >
> > > >> > > > > https://github.com/larroy/mxnet_omp
> > > >> > > > >
> > > >> > > > > Kind regards.
> > > >> > > > >
> > > >> > > > > Pedro.
> > > >> > > > >
> > > >> > > > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <
> > > >> > > marco.g.abreu@gmail.com>
> > > >> > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > As already proposed, I think the easiest way to get a common
> > > >> > > > > understanding
> > > >> > > > > > is if we start with a few docker containers. Pedro, would
> > it be
> > > >> > > possible
> > > >> > > > > > for you to wrap your benchmarks into a few containers that
> > will
> > > >> > > produce
> > > >> > > > > > your shown results? That way, we can avoid possible
> > > >> > > misunderstandings and
> > > >> > > > > > also pinpoint the exact parts where people disagree or
> > > >> misunderstood
> > > >> > > each
> > > >> > > > > > other.
> > > >> > > > > >
> > > >> > > > > > -Marco
> > > >> > > > > >
> > > >> > > > > > Pedro Larroy <pe...@gmail.com> schrieb am Do.,
> > > >> 20. Juni
> > > >> > > > > 2019,
> > > >> > > > > > 21:47:
> > > >> > > > > >
> > > >> > > > > > > I can confirm that we are linking with two versions of
> > omp,
> > > >> I'm
> > > >> > > > > > > gaining more clarity into this topic, but I have still
> > > >> questions,
> > > >> > > the
> > > >> > > > > > > facts that I got so far are the folllowing:
> > > >> > > > > > >
> > > >> > > > > > > * #1: We are linking with two versions of omp, intel's omp
> > > >> and llvm
> > > >> > > > > > > openmp when building with MKL enabled.
> > > >> > > > > > > * #2: We have 3 different possible OMP versions: Intel OMP
> > > >> (comes
> > > >> > > with
> > > >> > > > > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with
> > gcc)
> > > >> (This
> > > >> > > > > > > one is used on the PR proposed by Anton).
> > > >> > > > > > >
> > > >> > > > > > > Questions:
> > > >> > > > > > >
> > > >> > > > > > >  * #1 Is it ok to have two versions of openmp linked at
> > the
> > > >> same
> > > >> > > time?
> > > >> > > > > > >  * #2 Which implementation of OMP gives the best
> > > >> performance?  (See
> > > >> > > > > > > total training time of my measurement for a partial
> > answer)
> > > >> > > > > > >  * #3 Should we have a build flag so we can choose the OMP
> > > >> version
> > > >> > > at
> > > >> > > > > > > runtime?
> > > >> > > > > > >  * #4 Which Compiler and build flags did Chris use to get
> > 10x
> > > >> > > slowdown?
> > > >> > > > > > >  * #5 @Stas: is there a script to replicate your
> > benchmarks
> > > >> > > easily? If
> > > >> > > > > > > so could you provide a link?  I think we would need to
> > > >> reproduce
> > > >> > > your
> > > >> > > > > > > benchmarks and verify which versions are being linked.
> > It's
> > > >> > > possible
> > > >> > > > > > > that while compiling with MKL intel's omp was pulled in
> > > >> instead of
> > > >> > > > > > > GNU OpenMP.
> > > >> > > > > > >  * #6 @Chris: how to maintain the copy of LLVM's Openmp?
> > > >> Should we
> > > >> > > > > > > update the subrepo regularly?
> > > >> > > > > > >
> > > >> > > > > > > My conclusion so far:
> > > >> > > > > > >
> > > >> > > > > > >  * #1 We should avoid linking two versions of omp if
> > possible
> > > >> and
> > > >> > > > > > > allow users to choose one in the build as we do for BLAS.
> > > >> > > > > > >  * #2 For performance reasons and more control vs
> > different
> > > >> > > compiler
> > > >> > > > > > > versions seems it makes indeed sense to keep the LLVM
> > OpenMP
> > > >> > > version
> > > >> > > > > > > in 3rdparty for now. So unless some more data is
> > gathered, it
> > > >> makes
> > > >> > > > > > > sense not to remove it as of now.
> > > >> > > > > > >  * #3 We should provide build options to choose which
> > openmp
> > > >> > > library
> > > >> > > > > > > is to be used from the three options available, including
> > > >> libgomp.
> > > >> > > > > > >  * #4 Refining the build we could also enable OpenMP in
> > mac
> > > >> without
> > > >> > > > > > > additional contortions (doesn't work as of today):
> > > >> > > > > > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
> > > >> > > > > > >  * #5 We should add different omp versions to our
> > benchmarks
> > > >> and
> > > >> > > track
> > > >> > > > > > > the performance, so this data is available for prescribing
> > > >> the best
> > > >> > > > > > > build options and for binary releases.
> > > >> > > > > > >
> > > >> > > > > > > This is also an interesting related gh issue posted in the
> > > >> mkl-dnn
> > > >> > > > > > > repository:  https://github.com/intel/mkl-dnn/issues/230
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > I don't observe the order of magnitude divergence
> > reported by
> > > >> > > Chris in
> > > >> > > > > > > vanilla Ubuntu 18.04 in samples / s but the full training
> > > >> finishes
> > > >> > > > > > > indeed faster with the OMP from 3rdparty (LLVM openmp) vs
> > > >> libgomp.
> > > >> > > > > > >
> > > >> > > > > > > There's also differences in training time when using MKL
> > and
> > > >> the ,
> > > >> > > > > > > it's actually a bit slower, I don't know if it's related
> > to
> > > >> OMP.
> > > >> > > > > > >
> > > >> > > > > > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
> > > >> > > > > > >
> > > >> > > > > > > Anton's branch:  git@github.com:lebeg/incubator-mxnet.git
> > > >>  branch
> > > >> > > > > 'omp'
> > > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> > > >> > > > > > > build/libmxnet.so |grep -i omp
> > > >> > > > > > >         libgomp.so.1 =>
> > /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > >> > > > > > > (0x00007fd99a51d000)
> > > >> > > > > > >
> > > >> > > > > > > time python train_mnist.py
> > > >> > > > > > >
> > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.984176
> > > >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 41617.00
> > > >> samples/sec
> > > >> > > > > > >  accuracy=1.000000
> > > >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 47990.69
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999531
> > > >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 47517.01
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999687
> > > >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 47430.53
> > > >> samples/sec
> > > >> > > > > > >  accuracy=1.000000
> > > >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 47649.77
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999687
> > > >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 51708.12
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999687
> > > >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 57228.63
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999375
> > > >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 50887.85
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 53947.98
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999531
> > > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > >> > > > > > > INFO:root:Epoch[19] Time cost=1.219
> > > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983977
> > > >> > > > > > > 1011.98user 26.78system 0:31.54elapsed 3292%CPU
> > > >> (0avgtext+0avgdata
> > > >> > > > > > > 1146052maxresident)k
> > > >> > > > > > > 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps
> > > >> > > > > > >
> > > >> > > > > > > Master, MKL ON:
> > > >> > > > > > >
> > > >> > > > > > > (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification
> > > >> [master]>
> > > >> > > ldd
> > > >> > > > > > > ../../build/libmxnet.so | grep -i omp
> > > >> > > > > > >         libomp.so =>
> > > >> > > > > > >
> > > >> > >
> > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > >> > > > > > > (0x00007f05ba38f000)
> > > >> > > > > > >         libiomp5.so =>
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > >
> > > >> > >
> > > >>
> > /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
> > > >> > > > > > > (0x00007f05b09f4000)
> > > >> > > > > > >
> > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.982484
> > > >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 36651.63
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999691
> > > >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 45093.98
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 45146.84
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999687
> > > >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 45119.90
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999687
> > > >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 44998.96
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999531
> > > >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 45072.25
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 44969.79
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 44962.78
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 44945.47
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999375
> > > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > >> > > > > > > INFO:root:Epoch[19] Time cost=1.367
> > > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.982783
> > > >> > > > > > > 854.97user 847.21system 0:41.44elapsed 4106%CPU
> > > >> (0avgtext+0avgdata
> > > >> > > > > > > 1154348maxresident)k
> > > >> > > > > > > 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > MKL OFF:
> > > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]>
> > grep -i
> > > >> MKL
> > > >> > > > > > > cmake_options.yml
> > > >> > > > > > > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
> > > >> > > > > > > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL
> > > >> found) IF
> > > >> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > >> > > > > > > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL
> > found)
> > > >> IF
> > > >> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
> > > >> > > > > > > build/libmxnet.so |grep -i omp
> > > >> > > > > > >         libomp.so =>
> > > >> > > > > > >
> > > >> > >
> > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > >> > > > > > > (0x00007fb720c54000)
> > > >> > > > > > >
> > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.983479
> > > >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 46784.02
> > > >> samples/sec
> > > >> > > > > > >  accuracy=1.000000
> > > >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 48824.29
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999687
> > > >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 49190.31
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999687
> > > >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 51518.77
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 51551.62
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 49026.35
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 49002.46
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999375
> > > >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 48980.55
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999687
> > > >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 47402.56
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999767
> > > >> > > > > > > INFO:root:Epoch[19] Time cost=1.259
> > > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983181
> > > >> > > > > > > 755.36user 754.94system 0:35.89elapsed 4207%CPU
> > > >> (0avgtext+0avgdata
> > > >> > > > > > > 1147008maxresident)k
> > > >> > > > > > > 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps
> > > >> > > > > > >
> > > >> > > > > > > Let me know what you think.
> > > >> > > > > > >
> > > >> > > > > > > Link to the original PR:
> > > >> > > > > > > https://github.com/apache/incubator-mxnet/pull/12160
> > > >> > > > > > >
> > > >> > > > > > > Thanks.
> > > >> > > > > > >
> > > >> > > > > > > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
> > > >> > > > > > > <ke...@gmail.com> wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > "if you’re linking in two then you’re doing something
> > > >> wrong."
> > > >> > > > > Correct,
> > > >> > > > > > > > that's one thing I believe we've got consensus on.  So
> > > >> let's call
> > > >> > > > > that
> > > >> > > > > > > out
> > > >> > > > > > > > as a bug to be fixed.
> > > >> > > > > > > >
> > > >> > > > > > > > Let's move forward with some reproducible numbers and
> > then
> > > >> > > discuss
> > > >> > > > > the
> > > >> > > > > > > pros
> > > >> > > > > > > > / cons of which particular OMP implementation we should
> > use.
> > > >> > > > > > > >
> > > >> > > > > > > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <
> > > >> > > > > > > pedro.larroy.lists@gmail.com>
> > > >> > > > > > > > wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > > Hi Chris
> > > >> > > > > > > > >
> > > >> > > > > > > > > I would ask you to have a bit of patience and help us
> > > >> with your
> > > >> > > > > > > > > experience in this matter. Nobody is ignoring
> > anything, I
> > > >> > > think we
> > > >> > > > > are
> > > >> > > > > > > > > individually gathering feedbacks and trying to
> > understand
> > > >> the
> > > >> > > > > multiple
> > > >> > > > > > > > > contributions done to this topic including yours,
> > then go
> > > >> step
> > > >> > > by
> > > >> > > > > > > > > step, understand what is going on and run experiments
> > and
> > > >> > > report
> > > >> > > > > back
> > > >> > > > > > > > > to the list or the corresponding github item. It was
> > > >> suggested
> > > >> > > by
> > > >> > > > > > > > > Kellen to prepare some containers, this takes effort.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Regarding your final comment, most of us also have
> > many
> > > >> other
> > > >> > > > > things
> > > >> > > > > > > > > to do and responsibilities even if our daytime jobs
> > might
> > > >> > > involve
> > > >> > > > > > > > > MXNet in some form or another. I think that's part of
> > the
> > > >> > > privilege
> > > >> > > > > > > > > and responsibility of working close with an open
> > source
> > > >> > > project and
> > > >> > > > > > > > > the magic of collaboration across organizations. Let's
> > > >> all be
> > > >> > > > > patient
> > > >> > > > > > > > > and take some time to understand and reason about this
> > > >> topic
> > > >> > > which
> > > >> > > > > is
> > > >> > > > > > > > > not simple. Since we decided to step back and gather
> > more
> > > >> data
> > > >> > > > > let's
> > > >> > > > > > > > > take time and do it properly.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Personally I hope to find time to look again into this
> > > >> issue
> > > >> > > before
> > > >> > > > > > > > > the end of the week.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Thanks.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Pedro.
> > > >> > > > > > > > >
> > > >> > > > > > > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <
> > > >> > > > > cjolivier01@apache.org>
> > > >> > > > > > > > > wrote:
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > if you’re linking in two then you’re doing something
> > > >> wrong.
> > > >> > > You
> > > >> > > > > can
> > > >> > > > > > > see
> > > >> > > > > > > > > by
> > > >> > > > > > > > > > my email yesterday that only one is linked in. This
> > is
> > > >> also
> > > >> > > the
> > > >> > > > > case
> > > >> > > > > > > with
> > > >> > > > > > > > > > the mkl version built by the Makefile — only the
> > Intel
> > > >> OMP
> > > >> > > > > library is
> > > >> > > > > > > > > used
> > > >> > > > > > > > > > (no libgomp).
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > That being said, Do you have clear evidence that
> > using
> > > >> Intel
> > > >> > > OMP
> > > >> > > > > is
> > > >> > > > > > > both
> > > >> > > > > > > > > > problematic and the situation isn’t fixable?  The
> > > >> burden of
> > > >> > > > > proof is
> > > >> > > > > > > on
> > > >> > > > > > > > > the
> > > >> > > > > > > > > > ones requesting the change — it is not my
> > > >> responsibility to
> > > >> > > > > justify
> > > >> > > > > > > the
> > > >> > > > > > > > > > current state.  There must be something “terrible”
> > and
> > > >> > > unfixable
> > > >> > > > > to
> > > >> > > > > > > > > justify
> > > >> > > > > > > > > > a change.  I have seen no proof of this in all this
> > > >> time.
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > On a side note, I mentioned a couple of things in my
> > > >> email
> > > >> > > > > yesterday
> > > >> > > > > > > that
> > > >> > > > > > > > > > still are not being responded to (they were also
> > > >> ignored in
> > > >> > > the
> > > >> > > > > last
> > > >> > > > > > > > > > incarnation of this “discussion” — I have much
> > > >> experience in
> > > >> > > this
> > > >> > > > > > > matter
> > > >> > > > > > > > > to
> > > >> > > > > > > > > > assume “discussion” is a waste of my time, seeing
> > and I
> > > >> am
> > > >> > > not
> > > >> > > > > paid
> > > >> > > > > > > to
> > > >> > > > > > > > > > “work on” mxnet like y’all are).
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > -C
> > > >> > > > > > > > > >
> > > >> > > > > > > > > >
> > > >> > > > > > > > > >
> > > >> > > > > > > > > >
> > > >> > > > > > > > > >
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> > > >> > > > > > > > > > kellen.sunderland@gmail.com> wrote:
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > > I've also quite often seen two versions of OpenMP
> > > >> linked.
> > > >> > > I
> > > >> > > > > think
> > > >> > > > > > > we
> > > >> > > > > > > > > can
> > > >> > > > > > > > > > > all agree we probably want to avoid linking in two
> > > >> > > libraries
> > > >> > > > > that
> > > >> > > > > > > do
> > > >> > > > > > > > > > > effectively the same thing.
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > The performance questions should be fairly
> > straight
> > > >> > > forward to
> > > >> > > > > > > > > demonstrate
> > > >> > > > > > > > > > > right?  Could we just collaborate on a few minimal
> > > >> > > Dockerfiles
> > > >> > > > > that
> > > >> > > > > > > > > show
> > > >> > > > > > > > > > > (or don't show) Intel OpenMP performance speedups
> > > >> with the
> > > >> > > > > > > workloads
> > > >> > > > > > > > > Chris
> > > >> > > > > > > > > > > is referencing?
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov,
> > Stanislav <
> > > >> > > > > > > > > > > stanislav.tsukrov@gmail.com> wrote:
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > > Hi, Chris!
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > Stas here - I've gathered that performance data.
> > > >> > > > > > > > > > > > Sure thing, I can be wrong, but please
> > elaborate a
> > > >> bit on
> > > >> > > > > what
> > > >> > > > > > > we are
> > > >> > > > > > > > > > > > missing.
> > > >> > > > > > > > > > > > Be assured, intentional misdirection was never a
> > > >> case.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > Thanks a lot for being constructive.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > > Turning Intel OMP on and off (and MKL as well,
> > > >> since it
> > > >> > > > > tends
> > > >> > > > > > > to
> > > >> > > > > > > > > pull
> > > >> > > > > > > > > > > in
> > > >> > > > > > > > > > > > omp, depending which one is linked in).
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > We never ever considered turning MKL off. We
> > are on
> > > >> the
> > > >> > > same
> > > >> > > > > page
> > > >> > > > > > > > > here -
> > > >> > > > > > > > > > > > MKL is crucial for the performance.
> > > >> > > > > > > > > > > > Why should we? There's a GOMP-linked version of
> > MKL,
> > > >> > > that we
> > > >> > > > > can
> > > >> > > > > > > use.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > What we did - we measured, if using compilers
> > > >> default
> > > >> > > OpenMP
> > > >> > > > > > > > > > > > implementation instead of referenced source code
> > > >> > > > > distribution of
> > > >> > > > > > > > > OpenMP
> > > >> > > > > > > > > > > > makes anything slower.
> > > >> > > > > > > > > > > > We have found the impact to be hardly
> > measurable.
> > > >> > > > > > > > > > > > The difference between GOMP and iOMP is <5% on
> > our
> > > >> > > > > benchmarks,
> > > >> > > > > > > most
> > > >> > > > > > > > > of
> > > >> > > > > > > > > > > the
> > > >> > > > > > > > > > > > time less than that.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > We just suggest to simplify the build of mxnet,
> > by
> > > >> > > removing
> > > >> > > > > the
> > > >> > > > > > > > > > > > unnecessary dependency.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > During that we discovered for example the
> > following
> > > >> > > amazing
> > > >> > > > > > > issue:
> > > >> > > > > > > > > > > >
> > > >> https://github.com/apache/incubator-mxnet/issues/14087
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > Best Regards
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > Stas
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > On 18.06.19, 18:24, "Chris Olivier" <
> > > >> > > cjolivier01@gmail.com>
> > > >> > > > > > > wrote:
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     I am very reluctant to feed the trolls
> > again,
> > > >> and
> > > >> > > this
> > > >> > > > > will
> > > >> > > > > > > be
> > > >> > > > > > > > > teh
> > > >> > > > > > > > > > > last
> > > >> > > > > > > > > > > >     time I address Pedro or Anton on the
> > subject,
> > > >> but
> > > >> > > since I
> > > >> > > > > > > think
> > > >> > > > > > > > > the
> > > >> > > > > > > > > > > > numbers
> > > >> > > > > > > > > > > >     being presented are incorrect (either by te
> > > >> builders
> > > >> > > not
> > > >> > > > > > > really
> > > >> > > > > > > > > > > >     understanding what they are building, or
> > > >> possibly
> > > >> > > > > intentional
> > > >> > > > > > > > > > > > misdirection):
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     Turning Intel OMP on and off (and MKL as
> > well,
> > > >> since
> > > >> > > it
> > > >> > > > > > > tends to
> > > >> > > > > > > > > pull
> > > >> > > > > > > > > > > > in
> > > >> > > > > > > > > > > >     omp, depending which one is linked in).
> > > >> > > > > > > > > > > >     There is a HUGE difference.  This is
> > consistent
> > > >> with
> > > >> > > my
> > > >> > > > > > > > > experience
> > > >> > > > > > > > > > > > before
> > > >> > > > > > > > > > > >     when it was added.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     default mnist:
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     python
> > > >> ../example/image-classification/train_mnist.py
> > > >> > > > > > > > > > > >     INFO:root:start with arguments
> > > >> > > Namespace(add_stn=False,
> > > >> > > > > > > > > > > batch_size=64,
> > > >> > > > > > > > > > > >     disp_batches=100, dtype='float32',
> > > >> gc_threshold=0.5,
> > > >> > > > > > > > > gc_type='none',
> > > >> > > > > > > > > > > >     gpus=None, image_shape='1, 28, 28',
> > > >> > > > > initializer='default',
> > > >> > > > > > > > > > > >     kv_store='device', load_epoch=None, loss='',
> > > >> lr=0.05,
> > > >> > > > > > > > > lr_factor=0.1,
> > > >> > > > > > > > > > > >     lr_step_epochs='10', macrobatch_size=0,
> > > >> > > > > model_prefix=None,
> > > >> > > > > > > > > mom=0.9,
> > > >> > > > > > > > > > > >     monitor=0, network='mlp', num_classes=10,
> > > >> > > num_epochs=20,
> > > >> > > > > > > > > > > >     num_examples=60000, num_layers=None,
> > > >> optimizer='sgd',
> > > >> > > > > > > > > > > >     profile_server_suffix='',
> > > >> profile_worker_suffix='',
> > > >> > > > > > > > > save_period=1,
> > > >> > > > > > > > > > > >     test_io=0, top_k=0, warmup_epochs=5,
> > > >> > > > > > > warmup_strategy='linear',
> > > >> > > > > > > > > > > > wd=0.0001)
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     INTEL OMP:
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     ldd libmxnet.so | grep omp
> > > >> > > > > > > > > > > >             libomp.so =>
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > >
> > > >> /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > > >> > > > > > > > > > > >     (0x00007f978fde7000)
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     :root:Epoch[0] Batch [0-100]        Speed:
> > > >> 31548.09
> > > >> > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.780012
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]
> > Speed:
> > > >> > > 16073.21
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.920469
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]
> > Speed:
> > > >> > > 19075.91
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.928281
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]
> > Speed:
> > > >> > > 23211.36
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.942813
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]
> > Speed:
> > > >> > > 22139.79
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.938750
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]
> > Speed:
> > > >> > > 23225.52
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.946562
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]
> > Speed:
> > > >> > > 19547.41
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.953281
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]
> > Speed:
> > > >> > > 24111.73
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.951562
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]
> > Speed:
> > > >> > > 13959.88
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.957500
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Train-accuracy=0.925423
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Time cost=3.806
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0]
> > Validation-accuracy=0.962580
> > > >> > > > > > > > > > > >     INFO:root:Epoch[1] Batch [0-100]
> > Speed:
> > > >> > > 24560.21
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.968131
> > > >> > > > > > > > > > > >     INFO:root:Epoch[1] Batch [100-200]
> > Speed:
> > > >> > > 23457.03
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.966250
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     LIBGOMP:
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     ldd libmxnet.so | grep omp
> > > >> > > > > > > > > > > >             libgomp.so.1 =>
> > > >> > > > > > > /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > >> > > > > > > > > > > >     (0x00007f25c25dd000)
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [0-100]
> > Speed:
> > > >> > > 1731.01
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >      accuracy=0.782488
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]
> > Speed:
> > > >> > > 3551.32
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >      accuracy=0.907813
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]
> > Speed:
> > > >> > > 1991.00
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >      accuracy=0.927188
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]
> > Speed:
> > > >> > > 2175.45
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >      accuracy=0.937969
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]
> > Speed:
> > > >> > > 1644.95
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >      accuracy=0.942187
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]
> > Speed:
> > > >> > > 6444.58
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >      accuracy=0.950156
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]
> > Speed:
> > > >> > > 7842.16
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >      accuracy=0.947969
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]
> > Speed:
> > > >> > > 9412.07
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >      accuracy=0.953750
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]
> > Speed:
> > > >> > > 12707.58
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.953125
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     That being said, there's other issued beyond
> > > >> speed.
> > > >> > > The
> > > >> > > > > > > DEFAULT
> > > >> > > > > > > > > > > build
> > > >> > > > > > > > > > > > from
> > > >> > > > > > > > > > > >     makefile (not CMake) uses Intel OMP mkl (I
> > > >> showed
> > > >> > > > > before) and
> > > >> > > > > > > > > > > > mysteriously
> > > >> > > > > > > > > > > >     it has no issues?  This seems highly
> > suspicious.
> > > >> > > All I
> > > >> > > > > see
> > > >> > > > > > > is a
> > > >> > > > > > > > > lot
> > > >> > > > > > > > > > > of
> > > >> > > > > > > > > > > >     hand-waving and conjecture and pointing to
> > > >> > > StackOverflow
> > > >> > > > > > > posts
> > > >> > > > > > > > > made
> > > >> > > > > > > > > > > by
> > > >> > > > > > > > > > > >     people who may be of questionable pedigree
> > to
> > > >> begin
> > > >> > > with.
> > > >> > > > > > > This
> > > >> > > > > > > > > > > smells
> > > >> > > > > > > > > > > > of a
> > > >> > > > > > > > > > > >     Pedro-ego-fight rather than one of purely
> > > >> technical
> > > >> > > > > merit.
> > > >> > > > > > > > > Also, if
> > > >> > > > > > > > > > > > one
> > > >> > > > > > > > > > > >     knows how OMP works,  they would be very
> > > >> suspicious
> > > >> > > of
> > > >> > > > > the
> > > >> > > > > > > > > > > > "intermittent
> > > >> > > > > > > > > > > >     hangs" claim -- that's probably just broken
> > race
> > > >> > > > > conditions
> > > >> > > > > > > > > elsewhere
> > > >> > > > > > > > > > > > until
> > > >> > > > > > > > > > > >     proven differently.  It'd tend freeze on the
> > > >> first
> > > >> > > use if
> > > >> > > > > > > > > something
> > > >> > > > > > > > > > > is
> > > >> > > > > > > > > > > >     wrong (try using libgomp after a fork and
> > see),
> > > >> since
> > > >> > > > > worker
> > > >> > > > > > > > > threads"
> > > >> > > > > > > > > > > >     wouldn't be assigned/joined properly.
> > IntelOMP
> > > >> is
> > > >> > > > > faster,
> > > >> > > > > > > but
> > > >> > > > > > > > > also
> > > >> > > > > > > > > > > has
> > > >> > > > > > > > > > > >     other advantages, such as allowing OMP
> > after a
> > > >> fork.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     I actually addressed a lot of issues and
> > ask for
> > > >> > > > > > > clarification
> > > >> > > > > > > > > in the
> > > >> > > > > > > > > > > >     original PR's way back when, but they're all
> > > >> just
> > > >> > > > > ignored.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     -Chris
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > >
> > > >> > > > >
> > > >> > >
> > > >>
> > > >
> >
> >


Re: OMP

Posted by Chris Olivier <cj...@gmail.com>.
The call stacks there are mostly associated with the execution engine
threads, which are not OMP threads.  That lockup doesn't look to me to be
related to OMP   -- the execution engine uses its own thread pool logic --
I'm pretty familiar with that part of the code.  Unless I am missing one --
can you point to the one that looks OMP-related?


On Tue, Jun 25, 2019 at 10:35 AM Pedro Larroy <pe...@gmail.com>
wrote:

> Thanks for digging that out Kellen. That's good info so maybe it would
> be good to rework the fix with the info you provided and remove the
> pthread_atfork handlers.
> Do you think setting the device would avoid the problem seen on the
> backtrace you provided?  specifically here:
>
> https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600#file-hang_bt-L24
>
> On Mon, Jun 24, 2019 at 6:43 PM kellen sunderland
> <ke...@gmail.com> wrote:
> >
> > I remember at the time we also had a read through of this blog post, but
> to
> > use the code looked like it was following the advice:
> >
> https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
> >
> > On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland <
> > kellen.sunderland@gmail.com> wrote:
> >
> > > I remember this hang as well, it was pretty hard to reproduce IIRC.  I
> > > believe the stacks for the hang are here:
> > >
> https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600
> and
> > > the trick was we could only debug it up to the point that we hit:
> > >
> > > #0  0x00007fec6df1ba4f in futex_wait (private=0, expected=1,
> > > futex_word=0x7fec60843758)
> > > at ../sysdeps/unix/sysv/linux/futex-internal.h:61
> > > #1  futex_wait_simple (private=0, expected=1,
> futex_word=0x7fec60843758)
> > >     at ../sysdeps/nptl/futex-internal.h:135
> > > #2  __pthread_once_slow (once_control=0x7fec60843758,
> > > init_routine=0x7fec605f38f0)
> > >     at pthread_once.c:105
> > > ...
> > > #6  0x00007fec6061c577 in cudaSetDevice () from
> > > /usr/local/cuda/lib64/libcudart.so.9.0
> > >
> > > because the code in libcudart is obviously closed source we couldn't
> dig
> > > into what threading work was going on when we called cudaSetDevice.
> > >
> > > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy <
> pedro.larroy.lists@gmail.com>
> > > wrote:
> > >
> > >> If you check initialize.cc we seem to be explicitly disabling that
> > >> behaviour in pthread_at_fork which seems to cause thread contention
> > >> during multiprocessing. Why do we need this major advantage for the
> > >> library if that's the case?
> > >>
> > >> Related PRs:
> > >>
> > >> https://github.com/apache/incubator-mxnet/pull/10820
> > >> https://github.com/apache/incubator-mxnet/issues/14396
> > >>
> > >> The original code was authored in this PR:
> > >>
> > >> https://github.com/apache/incubator-mxnet/pull/8677
> > >>
> > >> I actually remember this fix, it was done during a release as the cuda
> > >> runtime was forking and the engine was being re-entered. If that
> > >> situation is now happening anymore it might not be needed any longer.
> > >> I don't think we know the cause why there was a fork inside cuda, so
> > >> the code has grown around a fix for an issue which its root cause was
> > >> not understood, and side effects which this fix caused afterwards.
> > >>
> > >> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in
> > >> the link above, no libgomp.
> > >>
> > >> I didn't try the Make build.
> > >>
> > >> I would refactor the code linked above and stop using pthread_at_fork,
> > >> since OMP assumes it won't be initialized twice, but needs to be very
> > >> well tested to make sure it doesn't cause bugs or affect the fixes
> > >> done on the linked PRs above.
> > >>
> > >> Pedro.
> > >>
> > >> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier <cj...@gmail.com>
> > >> wrote:
> > >> >
> > >> > one major advantage of intel/llvm omp is that it spawns a new thread
> > >> pool
> > >> > after fork if a thread pool was already created. this is so that omp
> > >> can be
> > >> > used in the forked processes. libgomp doesn’t do this so it’ll just
> > >> lock up
> > >> > if you try to do omp in the forked process.
> > >> >
> > >> > is your build linking libgomp as well?
> > >> >
> > >> > standard mkl build (from Makefile) uses same omp library. are there
> > >> > problems with that build?
> > >> >
> > >> > what changes need to be made to make the assertion not fire?
> > >> >
> > >> > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy <
> > >> pedro.larroy.lists@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > There's an assertion which is easily reproducible, and also
> there's a
> > >> > > crash including core dump, the latter is not easy to reproduce
> for me
> > >> > > in different environments. I have also seen mxnet getting stuck
> > >> > > without progressing with this build configuration and using no
> CPU at
> > >> > > all when running unit tests.
> > >> > >
> > >> > > In my view, the root cause of the assertion is that we are
> re-entering
> > >> > > OMP initialization when spawning threads on the following code
> through
> > >> > > pthread_at_fork
> > >> > >
> > >> > >
> > >>
> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
> > >> > >
> > >> > > This causes double initialization of the OMP engine, including the
> > >> > > assertion which you are asking about,  and I suspect some
> additional
> > >> > > overhead. That's the shady forking part you are asking for.
> > >> > >
> > >> > > A question for you: What is the cause of runtime differences
> between
> > >> > > OMP runtimes? Shouldn't the implementation overhead diminish as
> > >> > > threads run longer?
> > >> > >
> > >> > > Pedro.
> > >> > >
> > >> > > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier <
> cjolivier01@gmail.com>
> > >> > > wrote:
> > >> > > >
> > >> > > > What’s the reason for the assertion failure? btw classifying an
> > >> assertion
> > >> > > > failure a “crash” is debatable. As I stated in the original
> issue a
> > >> long
> > >> > > > time ago, it’s possible something shady is being done with when
> > >> forking
> > >> > > > that should be fixed.  The assertion should be root caused.
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <
> > >> > > pedro.larroy.lists@gmail.com>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Added a dockerfile, and reports of a crash in my local machine
> > >> when
> > >> > > > > running MKL+OMP+DEBUG, with Anton's branch the crash happened
> as
> > >> well.
> > >> > > > > I couldn't reproduce the crash on my EC2 machine:
> > >> > > > > Added the backtrace of the crash as well.
> > >> > > > >
> > >> > > > > https://github.com/apache/incubator-mxnet/issues/10856
> > >> > > > >
> > >> > > > > Dockerfile here:
> > >> > > > >
> > >> > > > > https://github.com/larroy/mxnet_omp
> > >> > > > >
> > >> > > > > Kind regards.
> > >> > > > >
> > >> > > > > Pedro.
> > >> > > > >
> > >> > > > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <
> > >> > > marco.g.abreu@gmail.com>
> > >> > > > > wrote:
> > >> > > > > >
> > >> > > > > > As already proposed, I think the easiest way to get a common
> > >> > > > > understanding
> > >> > > > > > is if we start with a few docker containers. Pedro, would
> it be
> > >> > > possible
> > >> > > > > > for you to wrap your benchmarks into a few containers that
> will
> > >> > > produce
> > >> > > > > > your shown results? That way, we can avoid possible
> > >> > > misunderstandings and
> > >> > > > > > also pinpoint the exact parts where people disagree or
> > >> misunderstood
> > >> > > each
> > >> > > > > > other.
> > >> > > > > >
> > >> > > > > > -Marco
> > >> > > > > >
> > >> > > > > > Pedro Larroy <pe...@gmail.com> schrieb am Do.,
> > >> 20. Juni
> > >> > > > > 2019,
> > >> > > > > > 21:47:
> > >> > > > > >
> > >> > > > > > > I can confirm that we are linking with two versions of
> omp,
> > >> I'm
> > >> > > > > > > gaining more clarity into this topic, but I have still
> > >> questions,
> > >> > > the
> > >> > > > > > > facts that I got so far are the folllowing:
> > >> > > > > > >
> > >> > > > > > > * #1: We are linking with two versions of omp, intel's omp
> > >> and llvm
> > >> > > > > > > openmp when building with MKL enabled.
> > >> > > > > > > * #2: We have 3 different possible OMP versions: Intel OMP
> > >> (comes
> > >> > > with
> > >> > > > > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with
> gcc)
> > >> (This
> > >> > > > > > > one is used on the PR proposed by Anton).
> > >> > > > > > >
> > >> > > > > > > Questions:
> > >> > > > > > >
> > >> > > > > > >  * #1 Is it ok to have two versions of openmp linked at
> the
> > >> same
> > >> > > time?
> > >> > > > > > >  * #2 Which implementation of OMP gives the best
> > >> performance?  (See
> > >> > > > > > > total training time of my measurement for a partial
> answer)
> > >> > > > > > >  * #3 Should we have a build flag so we can choose the OMP
> > >> version
> > >> > > at
> > >> > > > > > > runtime?
> > >> > > > > > >  * #4 Which Compiler and build flags did Chris use to get
> 10x
> > >> > > slowdown?
> > >> > > > > > >  * #5 @Stas: is there a script to replicate your
> benchmarks
> > >> > > easily? If
> > >> > > > > > > so could you provide a link?  I think we would need to
> > >> reproduce
> > >> > > your
> > >> > > > > > > benchmarks and verify which versions are being linked.
> It's
> > >> > > possible
> > >> > > > > > > that while compiling with MKL intel's omp was pulled in
> > >> instead of
> > >> > > > > > > GNU OpenMP.
> > >> > > > > > >  * #6 @Chris: how to maintain the copy of LLVM's Openmp?
> > >> Should we
> > >> > > > > > > update the subrepo regularly?
> > >> > > > > > >
> > >> > > > > > > My conclusion so far:
> > >> > > > > > >
> > >> > > > > > >  * #1 We should avoid linking two versions of omp if
> possible
> > >> and
> > >> > > > > > > allow users to choose one in the build as we do for BLAS.
> > >> > > > > > >  * #2 For performance reasons and more control vs
> different
> > >> > > compiler
> > >> > > > > > > versions seems it makes indeed sense to keep the LLVM
> OpenMP
> > >> > > version
> > >> > > > > > > in 3rdparty for now. So unless some more data is
> gathered, it
> > >> makes
> > >> > > > > > > sense not to remove it as of now.
> > >> > > > > > >  * #3 We should provide build options to choose which
> openmp
> > >> > > library
> > >> > > > > > > is to be used from the three options available, including
> > >> libgomp.
> > >> > > > > > >  * #4 Refining the build we could also enable OpenMP in
> mac
> > >> without
> > >> > > > > > > additional contortions (doesn't work as of today):
> > >> > > > > > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
> > >> > > > > > >  * #5 We should add different omp versions to our
> benchmarks
> > >> and
> > >> > > track
> > >> > > > > > > the performance, so this data is available for prescribing
> > >> the best
> > >> > > > > > > build options and for binary releases.
> > >> > > > > > >
> > >> > > > > > > This is also an interesting related gh issue posted in the
> > >> mkl-dnn
> > >> > > > > > > repository:  https://github.com/intel/mkl-dnn/issues/230
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > I don't observe the order of magnitude divergence
> reported by
> > >> > > Chris in
> > >> > > > > > > vanilla Ubuntu 18.04 in samples / s but the full training
> > >> finishes
> > >> > > > > > > indeed faster with the OMP from 3rdparty (LLVM openmp) vs
> > >> libgomp.
> > >> > > > > > >
> > >> > > > > > > There's also differences in training time when using MKL
> and
> > >> the ,
> > >> > > > > > > it's actually a bit slower, I don't know if it's related
> to
> > >> OMP.
> > >> > > > > > >
> > >> > > > > > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
> > >> > > > > > >
> > >> > > > > > > Anton's branch:  git@github.com:lebeg/incubator-mxnet.git
> > >>  branch
> > >> > > > > 'omp'
> > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> > >> > > > > > > build/libmxnet.so |grep -i omp
> > >> > > > > > >         libgomp.so.1 =>
> /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > >> > > > > > > (0x00007fd99a51d000)
> > >> > > > > > >
> > >> > > > > > > time python train_mnist.py
> > >> > > > > > >
> > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.984176
> > >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 41617.00
> > >> samples/sec
> > >> > > > > > >  accuracy=1.000000
> > >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 47990.69
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999531
> > >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 47517.01
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999687
> > >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 47430.53
> > >> samples/sec
> > >> > > > > > >  accuracy=1.000000
> > >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 47649.77
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999687
> > >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 51708.12
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999687
> > >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 57228.63
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999375
> > >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 50887.85
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999844
> > >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 53947.98
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999531
> > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > >> > > > > > > INFO:root:Epoch[19] Time cost=1.219
> > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983977
> > >> > > > > > > 1011.98user 26.78system 0:31.54elapsed 3292%CPU
> > >> (0avgtext+0avgdata
> > >> > > > > > > 1146052maxresident)k
> > >> > > > > > > 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps
> > >> > > > > > >
> > >> > > > > > > Master, MKL ON:
> > >> > > > > > >
> > >> > > > > > > (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification
> > >> [master]>
> > >> > > ldd
> > >> > > > > > > ../../build/libmxnet.so | grep -i omp
> > >> > > > > > >         libomp.so =>
> > >> > > > > > >
> > >> > >
> /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > >> > > > > > > (0x00007f05ba38f000)
> > >> > > > > > >         libiomp5.so =>
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > >
> > >> > >
> > >>
> /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
> > >> > > > > > > (0x00007f05b09f4000)
> > >> > > > > > >
> > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.982484
> > >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 36651.63
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999691
> > >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 45093.98
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999844
> > >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 45146.84
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999687
> > >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 45119.90
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999687
> > >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 44998.96
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999531
> > >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 45072.25
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999844
> > >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 44969.79
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999844
> > >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 44962.78
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999844
> > >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 44945.47
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999375
> > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > >> > > > > > > INFO:root:Epoch[19] Time cost=1.367
> > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.982783
> > >> > > > > > > 854.97user 847.21system 0:41.44elapsed 4106%CPU
> > >> (0avgtext+0avgdata
> > >> > > > > > > 1154348maxresident)k
> > >> > > > > > > 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > MKL OFF:
> > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]>
> grep -i
> > >> MKL
> > >> > > > > > > cmake_options.yml
> > >> > > > > > > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
> > >> > > > > > > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL
> > >> found) IF
> > >> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > >> > > > > > > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL
> found)
> > >> IF
> > >> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
> > >> > > > > > > build/libmxnet.so |grep -i omp
> > >> > > > > > >         libomp.so =>
> > >> > > > > > >
> > >> > >
> /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > >> > > > > > > (0x00007fb720c54000)
> > >> > > > > > >
> > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.983479
> > >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 46784.02
> > >> samples/sec
> > >> > > > > > >  accuracy=1.000000
> > >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 48824.29
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999687
> > >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 49190.31
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999687
> > >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 51518.77
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999844
> > >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 51551.62
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999844
> > >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 49026.35
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999844
> > >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 49002.46
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999375
> > >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 48980.55
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999687
> > >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 47402.56
> > >> samples/sec
> > >> > > > > > >  accuracy=0.999844
> > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999767
> > >> > > > > > > INFO:root:Epoch[19] Time cost=1.259
> > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983181
> > >> > > > > > > 755.36user 754.94system 0:35.89elapsed 4207%CPU
> > >> (0avgtext+0avgdata
> > >> > > > > > > 1147008maxresident)k
> > >> > > > > > > 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps
> > >> > > > > > >
> > >> > > > > > > Let me know what you think.
> > >> > > > > > >
> > >> > > > > > > Link to the original PR:
> > >> > > > > > > https://github.com/apache/incubator-mxnet/pull/12160
> > >> > > > > > >
> > >> > > > > > > Thanks.
> > >> > > > > > >
> > >> > > > > > > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
> > >> > > > > > > <ke...@gmail.com> wrote:
> > >> > > > > > > >
> > >> > > > > > > > "if you’re linking in two then you’re doing something
> > >> wrong."
> > >> > > > > Correct,
> > >> > > > > > > > that's one thing I believe we've got consensus on.  So
> > >> let's call
> > >> > > > > that
> > >> > > > > > > out
> > >> > > > > > > > as a bug to be fixed.
> > >> > > > > > > >
> > >> > > > > > > > Let's move forward with some reproducible numbers and
> then
> > >> > > discuss
> > >> > > > > the
> > >> > > > > > > pros
> > >> > > > > > > > / cons of which particular OMP implementation we should
> use.
> > >> > > > > > > >
> > >> > > > > > > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <
> > >> > > > > > > pedro.larroy.lists@gmail.com>
> > >> > > > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > Hi Chris
> > >> > > > > > > > >
> > >> > > > > > > > > I would ask you to have a bit of patience and help us
> > >> with your
> > >> > > > > > > > > experience in this matter. Nobody is ignoring
> anything, I
> > >> > > think we
> > >> > > > > are
> > >> > > > > > > > > individually gathering feedbacks and trying to
> understand
> > >> the
> > >> > > > > multiple
> > >> > > > > > > > > contributions done to this topic including yours,
> then go
> > >> step
> > >> > > by
> > >> > > > > > > > > step, understand what is going on and run experiments
> and
> > >> > > report
> > >> > > > > back
> > >> > > > > > > > > to the list or the corresponding github item. It was
> > >> suggested
> > >> > > by
> > >> > > > > > > > > Kellen to prepare some containers, this takes effort.
> > >> > > > > > > > >
> > >> > > > > > > > > Regarding your final comment, most of us also have
> many
> > >> other
> > >> > > > > things
> > >> > > > > > > > > to do and responsibilities even if our daytime jobs
> might
> > >> > > involve
> > >> > > > > > > > > MXNet in some form or another. I think that's part of
> the
> > >> > > privilege
> > >> > > > > > > > > and responsibility of working close with an open
> source
> > >> > > project and
> > >> > > > > > > > > the magic of collaboration across organizations. Let's
> > >> all be
> > >> > > > > patient
> > >> > > > > > > > > and take some time to understand and reason about this
> > >> topic
> > >> > > which
> > >> > > > > is
> > >> > > > > > > > > not simple. Since we decided to step back and gather
> more
> > >> data
> > >> > > > > let's
> > >> > > > > > > > > take time and do it properly.
> > >> > > > > > > > >
> > >> > > > > > > > > Personally I hope to find time to look again into this
> > >> issue
> > >> > > before
> > >> > > > > > > > > the end of the week.
> > >> > > > > > > > >
> > >> > > > > > > > > Thanks.
> > >> > > > > > > > >
> > >> > > > > > > > > Pedro.
> > >> > > > > > > > >
> > >> > > > > > > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <
> > >> > > > > cjolivier01@apache.org>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > if you’re linking in two then you’re doing something
> > >> wrong.
> > >> > > You
> > >> > > > > can
> > >> > > > > > > see
> > >> > > > > > > > > by
> > >> > > > > > > > > > my email yesterday that only one is linked in. This
> is
> > >> also
> > >> > > the
> > >> > > > > case
> > >> > > > > > > with
> > >> > > > > > > > > > the mkl version built by the Makefile — only the
> Intel
> > >> OMP
> > >> > > > > library is
> > >> > > > > > > > > used
> > >> > > > > > > > > > (no libgomp).
> > >> > > > > > > > > >
> > >> > > > > > > > > > That being said, Do you have clear evidence that
> using
> > >> Intel
> > >> > > OMP
> > >> > > > > is
> > >> > > > > > > both
> > >> > > > > > > > > > problematic and the situation isn’t fixable?  The
> > >> burden of
> > >> > > > > proof is
> > >> > > > > > > on
> > >> > > > > > > > > the
> > >> > > > > > > > > > ones requesting the change — it is not my
> > >> responsibility to
> > >> > > > > justify
> > >> > > > > > > the
> > >> > > > > > > > > > current state.  There must be something “terrible”
> and
> > >> > > unfixable
> > >> > > > > to
> > >> > > > > > > > > justify
> > >> > > > > > > > > > a change.  I have seen no proof of this in all this
> > >> time.
> > >> > > > > > > > > >
> > >> > > > > > > > > > On a side note, I mentioned a couple of things in my
> > >> email
> > >> > > > > yesterday
> > >> > > > > > > that
> > >> > > > > > > > > > still are not being responded to (they were also
> > >> ignored in
> > >> > > the
> > >> > > > > last
> > >> > > > > > > > > > incarnation of this “discussion” — I have much
> > >> experience in
> > >> > > this
> > >> > > > > > > matter
> > >> > > > > > > > > to
> > >> > > > > > > > > > assume “discussion” is a waste of my time, seeing
> and I
> > >> am
> > >> > > not
> > >> > > > > paid
> > >> > > > > > > to
> > >> > > > > > > > > > “work on” mxnet like y’all are).
> > >> > > > > > > > > >
> > >> > > > > > > > > > -C
> > >> > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> > >> > > > > > > > > > kellen.sunderland@gmail.com> wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > I've also quite often seen two versions of OpenMP
> > >> linked.
> > >> > > I
> > >> > > > > think
> > >> > > > > > > we
> > >> > > > > > > > > can
> > >> > > > > > > > > > > all agree we probably want to avoid linking in two
> > >> > > libraries
> > >> > > > > that
> > >> > > > > > > do
> > >> > > > > > > > > > > effectively the same thing.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > The performance questions should be fairly
> straight
> > >> > > forward to
> > >> > > > > > > > > demonstrate
> > >> > > > > > > > > > > right?  Could we just collaborate on a few minimal
> > >> > > Dockerfiles
> > >> > > > > that
> > >> > > > > > > > > show
> > >> > > > > > > > > > > (or don't show) Intel OpenMP performance speedups
> > >> with the
> > >> > > > > > > workloads
> > >> > > > > > > > > Chris
> > >> > > > > > > > > > > is referencing?
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov,
> Stanislav <
> > >> > > > > > > > > > > stanislav.tsukrov@gmail.com> wrote:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > > Hi, Chris!
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > Stas here - I've gathered that performance data.
> > >> > > > > > > > > > > > Sure thing, I can be wrong, but please
> elaborate a
> > >> bit on
> > >> > > > > what
> > >> > > > > > > we are
> > >> > > > > > > > > > > > missing.
> > >> > > > > > > > > > > > Be assured, intentional misdirection was never a
> > >> case.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > Thanks a lot for being constructive.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > > Turning Intel OMP on and off (and MKL as well,
> > >> since it
> > >> > > > > tends
> > >> > > > > > > to
> > >> > > > > > > > > pull
> > >> > > > > > > > > > > in
> > >> > > > > > > > > > > > omp, depending which one is linked in).
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > We never ever considered turning MKL off. We
> are on
> > >> the
> > >> > > same
> > >> > > > > page
> > >> > > > > > > > > here -
> > >> > > > > > > > > > > > MKL is crucial for the performance.
> > >> > > > > > > > > > > > Why should we? There's a GOMP-linked version of
> MKL,
> > >> > > that we
> > >> > > > > can
> > >> > > > > > > use.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > What we did - we measured, if using compilers
> > >> default
> > >> > > OpenMP
> > >> > > > > > > > > > > > implementation instead of referenced source code
> > >> > > > > distribution of
> > >> > > > > > > > > OpenMP
> > >> > > > > > > > > > > > makes anything slower.
> > >> > > > > > > > > > > > We have found the impact to be hardly
> measurable.
> > >> > > > > > > > > > > > The difference between GOMP and iOMP is <5% on
> our
> > >> > > > > benchmarks,
> > >> > > > > > > most
> > >> > > > > > > > > of
> > >> > > > > > > > > > > the
> > >> > > > > > > > > > > > time less than that.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > We just suggest to simplify the build of mxnet,
> by
> > >> > > removing
> > >> > > > > the
> > >> > > > > > > > > > > > unnecessary dependency.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > During that we discovered for example the
> following
> > >> > > amazing
> > >> > > > > > > issue:
> > >> > > > > > > > > > > >
> > >> https://github.com/apache/incubator-mxnet/issues/14087
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > Best Regards
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > Stas
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > On 18.06.19, 18:24, "Chris Olivier" <
> > >> > > cjolivier01@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >     I am very reluctant to feed the trolls
> again,
> > >> and
> > >> > > this
> > >> > > > > will
> > >> > > > > > > be
> > >> > > > > > > > > teh
> > >> > > > > > > > > > > last
> > >> > > > > > > > > > > >     time I address Pedro or Anton on the
> subject,
> > >> but
> > >> > > since I
> > >> > > > > > > think
> > >> > > > > > > > > the
> > >> > > > > > > > > > > > numbers
> > >> > > > > > > > > > > >     being presented are incorrect (either by te
> > >> builders
> > >> > > not
> > >> > > > > > > really
> > >> > > > > > > > > > > >     understanding what they are building, or
> > >> possibly
> > >> > > > > intentional
> > >> > > > > > > > > > > > misdirection):
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >     Turning Intel OMP on and off (and MKL as
> well,
> > >> since
> > >> > > it
> > >> > > > > > > tends to
> > >> > > > > > > > > pull
> > >> > > > > > > > > > > > in
> > >> > > > > > > > > > > >     omp, depending which one is linked in).
> > >> > > > > > > > > > > >     There is a HUGE difference.  This is
> consistent
> > >> with
> > >> > > my
> > >> > > > > > > > > experience
> > >> > > > > > > > > > > > before
> > >> > > > > > > > > > > >     when it was added.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >     default mnist:
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >     python
> > >> ../example/image-classification/train_mnist.py
> > >> > > > > > > > > > > >     INFO:root:start with arguments
> > >> > > Namespace(add_stn=False,
> > >> > > > > > > > > > > batch_size=64,
> > >> > > > > > > > > > > >     disp_batches=100, dtype='float32',
> > >> gc_threshold=0.5,
> > >> > > > > > > > > gc_type='none',
> > >> > > > > > > > > > > >     gpus=None, image_shape='1, 28, 28',
> > >> > > > > initializer='default',
> > >> > > > > > > > > > > >     kv_store='device', load_epoch=None, loss='',
> > >> lr=0.05,
> > >> > > > > > > > > lr_factor=0.1,
> > >> > > > > > > > > > > >     lr_step_epochs='10', macrobatch_size=0,
> > >> > > > > model_prefix=None,
> > >> > > > > > > > > mom=0.9,
> > >> > > > > > > > > > > >     monitor=0, network='mlp', num_classes=10,
> > >> > > num_epochs=20,
> > >> > > > > > > > > > > >     num_examples=60000, num_layers=None,
> > >> optimizer='sgd',
> > >> > > > > > > > > > > >     profile_server_suffix='',
> > >> profile_worker_suffix='',
> > >> > > > > > > > > save_period=1,
> > >> > > > > > > > > > > >     test_io=0, top_k=0, warmup_epochs=5,
> > >> > > > > > > warmup_strategy='linear',
> > >> > > > > > > > > > > > wd=0.0001)
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >     INTEL OMP:
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >     ldd libmxnet.so | grep omp
> > >> > > > > > > > > > > >             libomp.so =>
> > >> > > > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > >
> > >> /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > >> > > > > > > > > > > >     (0x00007f978fde7000)
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >     :root:Epoch[0] Batch [0-100]        Speed:
> > >> 31548.09
> > >> > > > > > > samples/sec
> > >> > > > > > > > > > > >     accuracy=0.780012
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]
> Speed:
> > >> > > 16073.21
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >     accuracy=0.920469
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]
> Speed:
> > >> > > 19075.91
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >     accuracy=0.928281
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]
> Speed:
> > >> > > 23211.36
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >     accuracy=0.942813
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]
> Speed:
> > >> > > 22139.79
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >     accuracy=0.938750
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]
> Speed:
> > >> > > 23225.52
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >     accuracy=0.946562
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]
> Speed:
> > >> > > 19547.41
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >     accuracy=0.953281
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]
> Speed:
> > >> > > 24111.73
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >     accuracy=0.951562
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]
> Speed:
> > >> > > 13959.88
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >     accuracy=0.957500
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Train-accuracy=0.925423
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Time cost=3.806
> > >> > > > > > > > > > > >     INFO:root:Epoch[0]
> Validation-accuracy=0.962580
> > >> > > > > > > > > > > >     INFO:root:Epoch[1] Batch [0-100]
> Speed:
> > >> > > 24560.21
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >     accuracy=0.968131
> > >> > > > > > > > > > > >     INFO:root:Epoch[1] Batch [100-200]
> Speed:
> > >> > > 23457.03
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >     accuracy=0.966250
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >     LIBGOMP:
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >     ldd libmxnet.so | grep omp
> > >> > > > > > > > > > > >             libgomp.so.1 =>
> > >> > > > > > > /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > >> > > > > > > > > > > >     (0x00007f25c25dd000)
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [0-100]
> Speed:
> > >> > > 1731.01
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >      accuracy=0.782488
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]
> Speed:
> > >> > > 3551.32
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >      accuracy=0.907813
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]
> Speed:
> > >> > > 1991.00
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >      accuracy=0.927188
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]
> Speed:
> > >> > > 2175.45
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >      accuracy=0.937969
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]
> Speed:
> > >> > > 1644.95
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >      accuracy=0.942187
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]
> Speed:
> > >> > > 6444.58
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >      accuracy=0.950156
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]
> Speed:
> > >> > > 7842.16
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >      accuracy=0.947969
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]
> Speed:
> > >> > > 9412.07
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >      accuracy=0.953750
> > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]
> Speed:
> > >> > > 12707.58
> > >> > > > > > > > > samples/sec
> > >> > > > > > > > > > > >     accuracy=0.953125
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >     That being said, there's other issued beyond
> > >> speed.
> > >> > > The
> > >> > > > > > > DEFAULT
> > >> > > > > > > > > > > build
> > >> > > > > > > > > > > > from
> > >> > > > > > > > > > > >     makefile (not CMake) uses Intel OMP mkl (I
> > >> showed
> > >> > > > > before) and
> > >> > > > > > > > > > > > mysteriously
> > >> > > > > > > > > > > >     it has no issues?  This seems highly
> suspicious.
> > >> > > All I
> > >> > > > > see
> > >> > > > > > > is a
> > >> > > > > > > > > lot
> > >> > > > > > > > > > > of
> > >> > > > > > > > > > > >     hand-waving and conjecture and pointing to
> > >> > > StackOverflow
> > >> > > > > > > posts
> > >> > > > > > > > > made
> > >> > > > > > > > > > > by
> > >> > > > > > > > > > > >     people who may be of questionable pedigree
> to
> > >> begin
> > >> > > with.
> > >> > > > > > > This
> > >> > > > > > > > > > > smells
> > >> > > > > > > > > > > > of a
> > >> > > > > > > > > > > >     Pedro-ego-fight rather than one of purely
> > >> technical
> > >> > > > > merit.
> > >> > > > > > > > > Also, if
> > >> > > > > > > > > > > > one
> > >> > > > > > > > > > > >     knows how OMP works,  they would be very
> > >> suspicious
> > >> > > of
> > >> > > > > the
> > >> > > > > > > > > > > > "intermittent
> > >> > > > > > > > > > > >     hangs" claim -- that's probably just broken
> race
> > >> > > > > conditions
> > >> > > > > > > > > elsewhere
> > >> > > > > > > > > > > > until
> > >> > > > > > > > > > > >     proven differently.  It'd tend freeze on the
> > >> first
> > >> > > use if
> > >> > > > > > > > > something
> > >> > > > > > > > > > > is
> > >> > > > > > > > > > > >     wrong (try using libgomp after a fork and
> see),
> > >> since
> > >> > > > > worker
> > >> > > > > > > > > threads"
> > >> > > > > > > > > > > >     wouldn't be assigned/joined properly.
> IntelOMP
> > >> is
> > >> > > > > faster,
> > >> > > > > > > but
> > >> > > > > > > > > also
> > >> > > > > > > > > > > has
> > >> > > > > > > > > > > >     other advantages, such as allowing OMP
> after a
> > >> fork.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >     I actually addressed a lot of issues and
> ask for
> > >> > > > > > > clarification
> > >> > > > > > > > > in the
> > >> > > > > > > > > > > >     original PR's way back when, but they're all
> > >> just
> > >> > > > > ignored.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >     -Chris
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > >
> > >> > > > >
> > >> > >
> > >>
> > >
>
>

Re: OMP

Posted by Pedro Larroy <pe...@gmail.com>.
Thanks for digging that out Kellen. That's good info so maybe it would
be good to rework the fix with the info you provided and remove the
pthread_atfork handlers.
Do you think setting the device would avoid the problem seen on the
backtrace you provided?  specifically here:
https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600#file-hang_bt-L24

On Mon, Jun 24, 2019 at 6:43 PM kellen sunderland
<ke...@gmail.com> wrote:
>
> I remember at the time we also had a read through of this blog post, but to
> use the code looked like it was following the advice:
> https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
>
> On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland <
> kellen.sunderland@gmail.com> wrote:
>
> > I remember this hang as well, it was pretty hard to reproduce IIRC.  I
> > believe the stacks for the hang are here:
> > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 and
> > the trick was we could only debug it up to the point that we hit:
> >
> > #0  0x00007fec6df1ba4f in futex_wait (private=0, expected=1,
> > futex_word=0x7fec60843758)
> > at ../sysdeps/unix/sysv/linux/futex-internal.h:61
> > #1  futex_wait_simple (private=0, expected=1, futex_word=0x7fec60843758)
> >     at ../sysdeps/nptl/futex-internal.h:135
> > #2  __pthread_once_slow (once_control=0x7fec60843758,
> > init_routine=0x7fec605f38f0)
> >     at pthread_once.c:105
> > ...
> > #6  0x00007fec6061c577 in cudaSetDevice () from
> > /usr/local/cuda/lib64/libcudart.so.9.0
> >
> > because the code in libcudart is obviously closed source we couldn't dig
> > into what threading work was going on when we called cudaSetDevice.
> >
> > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy <pe...@gmail.com>
> > wrote:
> >
> >> If you check initialize.cc we seem to be explicitly disabling that
> >> behaviour in pthread_at_fork which seems to cause thread contention
> >> during multiprocessing. Why do we need this major advantage for the
> >> library if that's the case?
> >>
> >> Related PRs:
> >>
> >> https://github.com/apache/incubator-mxnet/pull/10820
> >> https://github.com/apache/incubator-mxnet/issues/14396
> >>
> >> The original code was authored in this PR:
> >>
> >> https://github.com/apache/incubator-mxnet/pull/8677
> >>
> >> I actually remember this fix, it was done during a release as the cuda
> >> runtime was forking and the engine was being re-entered. If that
> >> situation is now happening anymore it might not be needed any longer.
> >> I don't think we know the cause why there was a fork inside cuda, so
> >> the code has grown around a fix for an issue which its root cause was
> >> not understood, and side effects which this fix caused afterwards.
> >>
> >> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in
> >> the link above, no libgomp.
> >>
> >> I didn't try the Make build.
> >>
> >> I would refactor the code linked above and stop using pthread_at_fork,
> >> since OMP assumes it won't be initialized twice, but needs to be very
> >> well tested to make sure it doesn't cause bugs or affect the fixes
> >> done on the linked PRs above.
> >>
> >> Pedro.
> >>
> >> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier <cj...@gmail.com>
> >> wrote:
> >> >
> >> > one major advantage of intel/llvm omp is that it spawns a new thread
> >> pool
> >> > after fork if a thread pool was already created. this is so that omp
> >> can be
> >> > used in the forked processes. libgomp doesn’t do this so it’ll just
> >> lock up
> >> > if you try to do omp in the forked process.
> >> >
> >> > is your build linking libgomp as well?
> >> >
> >> > standard mkl build (from Makefile) uses same omp library. are there
> >> > problems with that build?
> >> >
> >> > what changes need to be made to make the assertion not fire?
> >> >
> >> > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy <
> >> pedro.larroy.lists@gmail.com>
> >> > wrote:
> >> >
> >> > > There's an assertion which is easily reproducible, and also there's a
> >> > > crash including core dump, the latter is not easy to reproduce for me
> >> > > in different environments. I have also seen mxnet getting stuck
> >> > > without progressing with this build configuration and using no CPU at
> >> > > all when running unit tests.
> >> > >
> >> > > In my view, the root cause of the assertion is that we are re-entering
> >> > > OMP initialization when spawning threads on the following code through
> >> > > pthread_at_fork
> >> > >
> >> > >
> >> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
> >> > >
> >> > > This causes double initialization of the OMP engine, including the
> >> > > assertion which you are asking about,  and I suspect some additional
> >> > > overhead. That's the shady forking part you are asking for.
> >> > >
> >> > > A question for you: What is the cause of runtime differences between
> >> > > OMP runtimes? Shouldn't the implementation overhead diminish as
> >> > > threads run longer?
> >> > >
> >> > > Pedro.
> >> > >
> >> > > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier <cj...@gmail.com>
> >> > > wrote:
> >> > > >
> >> > > > What’s the reason for the assertion failure? btw classifying an
> >> assertion
> >> > > > failure a “crash” is debatable. As I stated in the original issue a
> >> long
> >> > > > time ago, it’s possible something shady is being done with when
> >> forking
> >> > > > that should be fixed.  The assertion should be root caused.
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <
> >> > > pedro.larroy.lists@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > > > Added a dockerfile, and reports of a crash in my local machine
> >> when
> >> > > > > running MKL+OMP+DEBUG, with Anton's branch the crash happened as
> >> well.
> >> > > > > I couldn't reproduce the crash on my EC2 machine:
> >> > > > > Added the backtrace of the crash as well.
> >> > > > >
> >> > > > > https://github.com/apache/incubator-mxnet/issues/10856
> >> > > > >
> >> > > > > Dockerfile here:
> >> > > > >
> >> > > > > https://github.com/larroy/mxnet_omp
> >> > > > >
> >> > > > > Kind regards.
> >> > > > >
> >> > > > > Pedro.
> >> > > > >
> >> > > > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <
> >> > > marco.g.abreu@gmail.com>
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > As already proposed, I think the easiest way to get a common
> >> > > > > understanding
> >> > > > > > is if we start with a few docker containers. Pedro, would it be
> >> > > possible
> >> > > > > > for you to wrap your benchmarks into a few containers that will
> >> > > produce
> >> > > > > > your shown results? That way, we can avoid possible
> >> > > misunderstandings and
> >> > > > > > also pinpoint the exact parts where people disagree or
> >> misunderstood
> >> > > each
> >> > > > > > other.
> >> > > > > >
> >> > > > > > -Marco
> >> > > > > >
> >> > > > > > Pedro Larroy <pe...@gmail.com> schrieb am Do.,
> >> 20. Juni
> >> > > > > 2019,
> >> > > > > > 21:47:
> >> > > > > >
> >> > > > > > > I can confirm that we are linking with two versions of omp,
> >> I'm
> >> > > > > > > gaining more clarity into this topic, but I have still
> >> questions,
> >> > > the
> >> > > > > > > facts that I got so far are the folllowing:
> >> > > > > > >
> >> > > > > > > * #1: We are linking with two versions of omp, intel's omp
> >> and llvm
> >> > > > > > > openmp when building with MKL enabled.
> >> > > > > > > * #2: We have 3 different possible OMP versions: Intel OMP
> >> (comes
> >> > > with
> >> > > > > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc)
> >> (This
> >> > > > > > > one is used on the PR proposed by Anton).
> >> > > > > > >
> >> > > > > > > Questions:
> >> > > > > > >
> >> > > > > > >  * #1 Is it ok to have two versions of openmp linked at the
> >> same
> >> > > time?
> >> > > > > > >  * #2 Which implementation of OMP gives the best
> >> performance?  (See
> >> > > > > > > total training time of my measurement for a partial answer)
> >> > > > > > >  * #3 Should we have a build flag so we can choose the OMP
> >> version
> >> > > at
> >> > > > > > > runtime?
> >> > > > > > >  * #4 Which Compiler and build flags did Chris use to get 10x
> >> > > slowdown?
> >> > > > > > >  * #5 @Stas: is there a script to replicate your benchmarks
> >> > > easily? If
> >> > > > > > > so could you provide a link?  I think we would need to
> >> reproduce
> >> > > your
> >> > > > > > > benchmarks and verify which versions are being linked. It's
> >> > > possible
> >> > > > > > > that while compiling with MKL intel's omp was pulled in
> >> instead of
> >> > > > > > > GNU OpenMP.
> >> > > > > > >  * #6 @Chris: how to maintain the copy of LLVM's Openmp?
> >> Should we
> >> > > > > > > update the subrepo regularly?
> >> > > > > > >
> >> > > > > > > My conclusion so far:
> >> > > > > > >
> >> > > > > > >  * #1 We should avoid linking two versions of omp if possible
> >> and
> >> > > > > > > allow users to choose one in the build as we do for BLAS.
> >> > > > > > >  * #2 For performance reasons and more control vs different
> >> > > compiler
> >> > > > > > > versions seems it makes indeed sense to keep the LLVM OpenMP
> >> > > version
> >> > > > > > > in 3rdparty for now. So unless some more data is gathered, it
> >> makes
> >> > > > > > > sense not to remove it as of now.
> >> > > > > > >  * #3 We should provide build options to choose which openmp
> >> > > library
> >> > > > > > > is to be used from the three options available, including
> >> libgomp.
> >> > > > > > >  * #4 Refining the build we could also enable OpenMP in mac
> >> without
> >> > > > > > > additional contortions (doesn't work as of today):
> >> > > > > > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
> >> > > > > > >  * #5 We should add different omp versions to our benchmarks
> >> and
> >> > > track
> >> > > > > > > the performance, so this data is available for prescribing
> >> the best
> >> > > > > > > build options and for binary releases.
> >> > > > > > >
> >> > > > > > > This is also an interesting related gh issue posted in the
> >> mkl-dnn
> >> > > > > > > repository:  https://github.com/intel/mkl-dnn/issues/230
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > I don't observe the order of magnitude divergence reported by
> >> > > Chris in
> >> > > > > > > vanilla Ubuntu 18.04 in samples / s but the full training
> >> finishes
> >> > > > > > > indeed faster with the OMP from 3rdparty (LLVM openmp) vs
> >> libgomp.
> >> > > > > > >
> >> > > > > > > There's also differences in training time when using MKL and
> >> the ,
> >> > > > > > > it's actually a bit slower, I don't know if it's related to
> >> OMP.
> >> > > > > > >
> >> > > > > > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
> >> > > > > > >
> >> > > > > > > Anton's branch:  git@github.com:lebeg/incubator-mxnet.git
> >>  branch
> >> > > > > 'omp'
> >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> >> > > > > > > build/libmxnet.so |grep -i omp
> >> > > > > > >         libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> >> > > > > > > (0x00007fd99a51d000)
> >> > > > > > >
> >> > > > > > > time python train_mnist.py
> >> > > > > > >
> >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.984176
> >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 41617.00
> >> samples/sec
> >> > > > > > >  accuracy=1.000000
> >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 47990.69
> >> samples/sec
> >> > > > > > >  accuracy=0.999531
> >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 47517.01
> >> samples/sec
> >> > > > > > >  accuracy=0.999687
> >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 47430.53
> >> samples/sec
> >> > > > > > >  accuracy=1.000000
> >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 47649.77
> >> samples/sec
> >> > > > > > >  accuracy=0.999687
> >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 51708.12
> >> samples/sec
> >> > > > > > >  accuracy=0.999687
> >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 57228.63
> >> samples/sec
> >> > > > > > >  accuracy=0.999375
> >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 50887.85
> >> samples/sec
> >> > > > > > >  accuracy=0.999844
> >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 53947.98
> >> samples/sec
> >> > > > > > >  accuracy=0.999531
> >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> >> > > > > > > INFO:root:Epoch[19] Time cost=1.219
> >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983977
> >> > > > > > > 1011.98user 26.78system 0:31.54elapsed 3292%CPU
> >> (0avgtext+0avgdata
> >> > > > > > > 1146052maxresident)k
> >> > > > > > > 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps
> >> > > > > > >
> >> > > > > > > Master, MKL ON:
> >> > > > > > >
> >> > > > > > > (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification
> >> [master]>
> >> > > ldd
> >> > > > > > > ../../build/libmxnet.so | grep -i omp
> >> > > > > > >         libomp.so =>
> >> > > > > > >
> >> > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> >> > > > > > > (0x00007f05ba38f000)
> >> > > > > > >         libiomp5.so =>
> >> > > > > > >
> >> > > > > > >
> >> > > > >
> >> > >
> >> /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
> >> > > > > > > (0x00007f05b09f4000)
> >> > > > > > >
> >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.982484
> >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 36651.63
> >> samples/sec
> >> > > > > > >  accuracy=0.999691
> >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 45093.98
> >> samples/sec
> >> > > > > > >  accuracy=0.999844
> >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 45146.84
> >> samples/sec
> >> > > > > > >  accuracy=0.999687
> >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 45119.90
> >> samples/sec
> >> > > > > > >  accuracy=0.999687
> >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 44998.96
> >> samples/sec
> >> > > > > > >  accuracy=0.999531
> >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 45072.25
> >> samples/sec
> >> > > > > > >  accuracy=0.999844
> >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 44969.79
> >> samples/sec
> >> > > > > > >  accuracy=0.999844
> >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 44962.78
> >> samples/sec
> >> > > > > > >  accuracy=0.999844
> >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 44945.47
> >> samples/sec
> >> > > > > > >  accuracy=0.999375
> >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> >> > > > > > > INFO:root:Epoch[19] Time cost=1.367
> >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.982783
> >> > > > > > > 854.97user 847.21system 0:41.44elapsed 4106%CPU
> >> (0avgtext+0avgdata
> >> > > > > > > 1154348maxresident)k
> >> > > > > > > 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > MKL OFF:
> >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> grep -i
> >> MKL
> >> > > > > > > cmake_options.yml
> >> > > > > > > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
> >> > > > > > > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL
> >> found) IF
> >> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> >> > > > > > > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found)
> >> IF
> >> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
> >> > > > > > > build/libmxnet.so |grep -i omp
> >> > > > > > >         libomp.so =>
> >> > > > > > >
> >> > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> >> > > > > > > (0x00007fb720c54000)
> >> > > > > > >
> >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.983479
> >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 46784.02
> >> samples/sec
> >> > > > > > >  accuracy=1.000000
> >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 48824.29
> >> samples/sec
> >> > > > > > >  accuracy=0.999687
> >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 49190.31
> >> samples/sec
> >> > > > > > >  accuracy=0.999687
> >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 51518.77
> >> samples/sec
> >> > > > > > >  accuracy=0.999844
> >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 51551.62
> >> samples/sec
> >> > > > > > >  accuracy=0.999844
> >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 49026.35
> >> samples/sec
> >> > > > > > >  accuracy=0.999844
> >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 49002.46
> >> samples/sec
> >> > > > > > >  accuracy=0.999375
> >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 48980.55
> >> samples/sec
> >> > > > > > >  accuracy=0.999687
> >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 47402.56
> >> samples/sec
> >> > > > > > >  accuracy=0.999844
> >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999767
> >> > > > > > > INFO:root:Epoch[19] Time cost=1.259
> >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983181
> >> > > > > > > 755.36user 754.94system 0:35.89elapsed 4207%CPU
> >> (0avgtext+0avgdata
> >> > > > > > > 1147008maxresident)k
> >> > > > > > > 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps
> >> > > > > > >
> >> > > > > > > Let me know what you think.
> >> > > > > > >
> >> > > > > > > Link to the original PR:
> >> > > > > > > https://github.com/apache/incubator-mxnet/pull/12160
> >> > > > > > >
> >> > > > > > > Thanks.
> >> > > > > > >
> >> > > > > > > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
> >> > > > > > > <ke...@gmail.com> wrote:
> >> > > > > > > >
> >> > > > > > > > "if you’re linking in two then you’re doing something
> >> wrong."
> >> > > > > Correct,
> >> > > > > > > > that's one thing I believe we've got consensus on.  So
> >> let's call
> >> > > > > that
> >> > > > > > > out
> >> > > > > > > > as a bug to be fixed.
> >> > > > > > > >
> >> > > > > > > > Let's move forward with some reproducible numbers and then
> >> > > discuss
> >> > > > > the
> >> > > > > > > pros
> >> > > > > > > > / cons of which particular OMP implementation we should use.
> >> > > > > > > >
> >> > > > > > > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <
> >> > > > > > > pedro.larroy.lists@gmail.com>
> >> > > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Hi Chris
> >> > > > > > > > >
> >> > > > > > > > > I would ask you to have a bit of patience and help us
> >> with your
> >> > > > > > > > > experience in this matter. Nobody is ignoring anything, I
> >> > > think we
> >> > > > > are
> >> > > > > > > > > individually gathering feedbacks and trying to understand
> >> the
> >> > > > > multiple
> >> > > > > > > > > contributions done to this topic including yours, then go
> >> step
> >> > > by
> >> > > > > > > > > step, understand what is going on and run experiments and
> >> > > report
> >> > > > > back
> >> > > > > > > > > to the list or the corresponding github item. It was
> >> suggested
> >> > > by
> >> > > > > > > > > Kellen to prepare some containers, this takes effort.
> >> > > > > > > > >
> >> > > > > > > > > Regarding your final comment, most of us also have many
> >> other
> >> > > > > things
> >> > > > > > > > > to do and responsibilities even if our daytime jobs might
> >> > > involve
> >> > > > > > > > > MXNet in some form or another. I think that's part of the
> >> > > privilege
> >> > > > > > > > > and responsibility of working close with an open source
> >> > > project and
> >> > > > > > > > > the magic of collaboration across organizations. Let's
> >> all be
> >> > > > > patient
> >> > > > > > > > > and take some time to understand and reason about this
> >> topic
> >> > > which
> >> > > > > is
> >> > > > > > > > > not simple. Since we decided to step back and gather more
> >> data
> >> > > > > let's
> >> > > > > > > > > take time and do it properly.
> >> > > > > > > > >
> >> > > > > > > > > Personally I hope to find time to look again into this
> >> issue
> >> > > before
> >> > > > > > > > > the end of the week.
> >> > > > > > > > >
> >> > > > > > > > > Thanks.
> >> > > > > > > > >
> >> > > > > > > > > Pedro.
> >> > > > > > > > >
> >> > > > > > > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <
> >> > > > > cjolivier01@apache.org>
> >> > > > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > if you’re linking in two then you’re doing something
> >> wrong.
> >> > > You
> >> > > > > can
> >> > > > > > > see
> >> > > > > > > > > by
> >> > > > > > > > > > my email yesterday that only one is linked in. This is
> >> also
> >> > > the
> >> > > > > case
> >> > > > > > > with
> >> > > > > > > > > > the mkl version built by the Makefile — only the Intel
> >> OMP
> >> > > > > library is
> >> > > > > > > > > used
> >> > > > > > > > > > (no libgomp).
> >> > > > > > > > > >
> >> > > > > > > > > > That being said, Do you have clear evidence that using
> >> Intel
> >> > > OMP
> >> > > > > is
> >> > > > > > > both
> >> > > > > > > > > > problematic and the situation isn’t fixable?  The
> >> burden of
> >> > > > > proof is
> >> > > > > > > on
> >> > > > > > > > > the
> >> > > > > > > > > > ones requesting the change — it is not my
> >> responsibility to
> >> > > > > justify
> >> > > > > > > the
> >> > > > > > > > > > current state.  There must be something “terrible” and
> >> > > unfixable
> >> > > > > to
> >> > > > > > > > > justify
> >> > > > > > > > > > a change.  I have seen no proof of this in all this
> >> time.
> >> > > > > > > > > >
> >> > > > > > > > > > On a side note, I mentioned a couple of things in my
> >> email
> >> > > > > yesterday
> >> > > > > > > that
> >> > > > > > > > > > still are not being responded to (they were also
> >> ignored in
> >> > > the
> >> > > > > last
> >> > > > > > > > > > incarnation of this “discussion” — I have much
> >> experience in
> >> > > this
> >> > > > > > > matter
> >> > > > > > > > > to
> >> > > > > > > > > > assume “discussion” is a waste of my time, seeing and I
> >> am
> >> > > not
> >> > > > > paid
> >> > > > > > > to
> >> > > > > > > > > > “work on” mxnet like y’all are).
> >> > > > > > > > > >
> >> > > > > > > > > > -C
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> >> > > > > > > > > > kellen.sunderland@gmail.com> wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > I've also quite often seen two versions of OpenMP
> >> linked.
> >> > > I
> >> > > > > think
> >> > > > > > > we
> >> > > > > > > > > can
> >> > > > > > > > > > > all agree we probably want to avoid linking in two
> >> > > libraries
> >> > > > > that
> >> > > > > > > do
> >> > > > > > > > > > > effectively the same thing.
> >> > > > > > > > > > >
> >> > > > > > > > > > > The performance questions should be fairly straight
> >> > > forward to
> >> > > > > > > > > demonstrate
> >> > > > > > > > > > > right?  Could we just collaborate on a few minimal
> >> > > Dockerfiles
> >> > > > > that
> >> > > > > > > > > show
> >> > > > > > > > > > > (or don't show) Intel OpenMP performance speedups
> >> with the
> >> > > > > > > workloads
> >> > > > > > > > > Chris
> >> > > > > > > > > > > is referencing?
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> >> > > > > > > > > > > stanislav.tsukrov@gmail.com> wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Hi, Chris!
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Stas here - I've gathered that performance data.
> >> > > > > > > > > > > > Sure thing, I can be wrong, but please elaborate a
> >> bit on
> >> > > > > what
> >> > > > > > > we are
> >> > > > > > > > > > > > missing.
> >> > > > > > > > > > > > Be assured, intentional misdirection was never a
> >> case.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Thanks a lot for being constructive.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > Turning Intel OMP on and off (and MKL as well,
> >> since it
> >> > > > > tends
> >> > > > > > > to
> >> > > > > > > > > pull
> >> > > > > > > > > > > in
> >> > > > > > > > > > > > omp, depending which one is linked in).
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > We never ever considered turning MKL off. We are on
> >> the
> >> > > same
> >> > > > > page
> >> > > > > > > > > here -
> >> > > > > > > > > > > > MKL is crucial for the performance.
> >> > > > > > > > > > > > Why should we? There's a GOMP-linked version of MKL,
> >> > > that we
> >> > > > > can
> >> > > > > > > use.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > What we did - we measured, if using compilers
> >> default
> >> > > OpenMP
> >> > > > > > > > > > > > implementation instead of referenced source code
> >> > > > > distribution of
> >> > > > > > > > > OpenMP
> >> > > > > > > > > > > > makes anything slower.
> >> > > > > > > > > > > > We have found the impact to be hardly measurable.
> >> > > > > > > > > > > > The difference between GOMP and iOMP is <5% on our
> >> > > > > benchmarks,
> >> > > > > > > most
> >> > > > > > > > > of
> >> > > > > > > > > > > the
> >> > > > > > > > > > > > time less than that.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > We just suggest to simplify the build of mxnet, by
> >> > > removing
> >> > > > > the
> >> > > > > > > > > > > > unnecessary dependency.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > During that we discovered for example the following
> >> > > amazing
> >> > > > > > > issue:
> >> > > > > > > > > > > >
> >> https://github.com/apache/incubator-mxnet/issues/14087
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Best Regards
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Stas
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On 18.06.19, 18:24, "Chris Olivier" <
> >> > > cjolivier01@gmail.com>
> >> > > > > > > wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >     I am very reluctant to feed the trolls again,
> >> and
> >> > > this
> >> > > > > will
> >> > > > > > > be
> >> > > > > > > > > teh
> >> > > > > > > > > > > last
> >> > > > > > > > > > > >     time I address Pedro or Anton on the subject,
> >> but
> >> > > since I
> >> > > > > > > think
> >> > > > > > > > > the
> >> > > > > > > > > > > > numbers
> >> > > > > > > > > > > >     being presented are incorrect (either by te
> >> builders
> >> > > not
> >> > > > > > > really
> >> > > > > > > > > > > >     understanding what they are building, or
> >> possibly
> >> > > > > intentional
> >> > > > > > > > > > > > misdirection):
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >     Turning Intel OMP on and off (and MKL as well,
> >> since
> >> > > it
> >> > > > > > > tends to
> >> > > > > > > > > pull
> >> > > > > > > > > > > > in
> >> > > > > > > > > > > >     omp, depending which one is linked in).
> >> > > > > > > > > > > >     There is a HUGE difference.  This is consistent
> >> with
> >> > > my
> >> > > > > > > > > experience
> >> > > > > > > > > > > > before
> >> > > > > > > > > > > >     when it was added.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >     default mnist:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >     python
> >> ../example/image-classification/train_mnist.py
> >> > > > > > > > > > > >     INFO:root:start with arguments
> >> > > Namespace(add_stn=False,
> >> > > > > > > > > > > batch_size=64,
> >> > > > > > > > > > > >     disp_batches=100, dtype='float32',
> >> gc_threshold=0.5,
> >> > > > > > > > > gc_type='none',
> >> > > > > > > > > > > >     gpus=None, image_shape='1, 28, 28',
> >> > > > > initializer='default',
> >> > > > > > > > > > > >     kv_store='device', load_epoch=None, loss='',
> >> lr=0.05,
> >> > > > > > > > > lr_factor=0.1,
> >> > > > > > > > > > > >     lr_step_epochs='10', macrobatch_size=0,
> >> > > > > model_prefix=None,
> >> > > > > > > > > mom=0.9,
> >> > > > > > > > > > > >     monitor=0, network='mlp', num_classes=10,
> >> > > num_epochs=20,
> >> > > > > > > > > > > >     num_examples=60000, num_layers=None,
> >> optimizer='sgd',
> >> > > > > > > > > > > >     profile_server_suffix='',
> >> profile_worker_suffix='',
> >> > > > > > > > > save_period=1,
> >> > > > > > > > > > > >     test_io=0, top_k=0, warmup_epochs=5,
> >> > > > > > > warmup_strategy='linear',
> >> > > > > > > > > > > > wd=0.0001)
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >     INTEL OMP:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >     ldd libmxnet.so | grep omp
> >> > > > > > > > > > > >             libomp.so =>
> >> > > > > > > > > > > >
> >> > > > > > > > >
> >> > > > >
> >> /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> >> > > > > > > > > > > >     (0x00007f978fde7000)
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >     :root:Epoch[0] Batch [0-100]        Speed:
> >> 31548.09
> >> > > > > > > samples/sec
> >> > > > > > > > > > > >     accuracy=0.780012
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed:
> >> > > 16073.21
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >     accuracy=0.920469
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed:
> >> > > 19075.91
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >     accuracy=0.928281
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed:
> >> > > 23211.36
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >     accuracy=0.942813
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed:
> >> > > 22139.79
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >     accuracy=0.938750
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed:
> >> > > 23225.52
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >     accuracy=0.946562
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed:
> >> > > 19547.41
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >     accuracy=0.953281
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed:
> >> > > 24111.73
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >     accuracy=0.951562
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed:
> >> > > 13959.88
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >     accuracy=0.957500
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Train-accuracy=0.925423
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Time cost=3.806
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Validation-accuracy=0.962580
> >> > > > > > > > > > > >     INFO:root:Epoch[1] Batch [0-100]        Speed:
> >> > > 24560.21
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >     accuracy=0.968131
> >> > > > > > > > > > > >     INFO:root:Epoch[1] Batch [100-200]      Speed:
> >> > > 23457.03
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >     accuracy=0.966250
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >     LIBGOMP:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >     ldd libmxnet.so | grep omp
> >> > > > > > > > > > > >             libgomp.so.1 =>
> >> > > > > > > /usr/lib/x86_64-linux-gnu/libgomp.so.1
> >> > > > > > > > > > > >     (0x00007f25c25dd000)
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [0-100]        Speed:
> >> > > 1731.01
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >      accuracy=0.782488
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed:
> >> > > 3551.32
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >      accuracy=0.907813
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed:
> >> > > 1991.00
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >      accuracy=0.927188
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed:
> >> > > 2175.45
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >      accuracy=0.937969
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed:
> >> > > 1644.95
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >      accuracy=0.942187
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed:
> >> > > 6444.58
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >      accuracy=0.950156
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed:
> >> > > 7842.16
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >      accuracy=0.947969
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed:
> >> > > 9412.07
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >      accuracy=0.953750
> >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed:
> >> > > 12707.58
> >> > > > > > > > > samples/sec
> >> > > > > > > > > > > >     accuracy=0.953125
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >     That being said, there's other issued beyond
> >> speed.
> >> > > The
> >> > > > > > > DEFAULT
> >> > > > > > > > > > > build
> >> > > > > > > > > > > > from
> >> > > > > > > > > > > >     makefile (not CMake) uses Intel OMP mkl (I
> >> showed
> >> > > > > before) and
> >> > > > > > > > > > > > mysteriously
> >> > > > > > > > > > > >     it has no issues?  This seems highly suspicious.
> >> > > All I
> >> > > > > see
> >> > > > > > > is a
> >> > > > > > > > > lot
> >> > > > > > > > > > > of
> >> > > > > > > > > > > >     hand-waving and conjecture and pointing to
> >> > > StackOverflow
> >> > > > > > > posts
> >> > > > > > > > > made
> >> > > > > > > > > > > by
> >> > > > > > > > > > > >     people who may be of questionable pedigree to
> >> begin
> >> > > with.
> >> > > > > > > This
> >> > > > > > > > > > > smells
> >> > > > > > > > > > > > of a
> >> > > > > > > > > > > >     Pedro-ego-fight rather than one of purely
> >> technical
> >> > > > > merit.
> >> > > > > > > > > Also, if
> >> > > > > > > > > > > > one
> >> > > > > > > > > > > >     knows how OMP works,  they would be very
> >> suspicious
> >> > > of
> >> > > > > the
> >> > > > > > > > > > > > "intermittent
> >> > > > > > > > > > > >     hangs" claim -- that's probably just broken race
> >> > > > > conditions
> >> > > > > > > > > elsewhere
> >> > > > > > > > > > > > until
> >> > > > > > > > > > > >     proven differently.  It'd tend freeze on the
> >> first
> >> > > use if
> >> > > > > > > > > something
> >> > > > > > > > > > > is
> >> > > > > > > > > > > >     wrong (try using libgomp after a fork and see),
> >> since
> >> > > > > worker
> >> > > > > > > > > threads"
> >> > > > > > > > > > > >     wouldn't be assigned/joined properly.  IntelOMP
> >> is
> >> > > > > faster,
> >> > > > > > > but
> >> > > > > > > > > also
> >> > > > > > > > > > > has
> >> > > > > > > > > > > >     other advantages, such as allowing OMP after a
> >> fork.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >     I actually addressed a lot of issues and ask for
> >> > > > > > > clarification
> >> > > > > > > > > in the
> >> > > > > > > > > > > >     original PR's way back when, but they're all
> >> just
> >> > > > > ignored.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >     -Chris
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > >
> >> > > > >
> >> > >
> >>
> >


Re: OMP

Posted by kellen sunderland <ke...@gmail.com>.
I remember at the time we also had a read through of this blog post, but to
use the code looked like it was following the advice:
https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/

On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland <
kellen.sunderland@gmail.com> wrote:

> I remember this hang as well, it was pretty hard to reproduce IIRC.  I
> believe the stacks for the hang are here:
> https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 and
> the trick was we could only debug it up to the point that we hit:
>
> #0  0x00007fec6df1ba4f in futex_wait (private=0, expected=1,
> futex_word=0x7fec60843758)
> at ../sysdeps/unix/sysv/linux/futex-internal.h:61
> #1  futex_wait_simple (private=0, expected=1, futex_word=0x7fec60843758)
>     at ../sysdeps/nptl/futex-internal.h:135
> #2  __pthread_once_slow (once_control=0x7fec60843758,
> init_routine=0x7fec605f38f0)
>     at pthread_once.c:105
> ...
> #6  0x00007fec6061c577 in cudaSetDevice () from
> /usr/local/cuda/lib64/libcudart.so.9.0
>
> because the code in libcudart is obviously closed source we couldn't dig
> into what threading work was going on when we called cudaSetDevice.
>
> On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy <pe...@gmail.com>
> wrote:
>
>> If you check initialize.cc we seem to be explicitly disabling that
>> behaviour in pthread_at_fork which seems to cause thread contention
>> during multiprocessing. Why do we need this major advantage for the
>> library if that's the case?
>>
>> Related PRs:
>>
>> https://github.com/apache/incubator-mxnet/pull/10820
>> https://github.com/apache/incubator-mxnet/issues/14396
>>
>> The original code was authored in this PR:
>>
>> https://github.com/apache/incubator-mxnet/pull/8677
>>
>> I actually remember this fix, it was done during a release as the cuda
>> runtime was forking and the engine was being re-entered. If that
>> situation is now happening anymore it might not be needed any longer.
>> I don't think we know the cause why there was a fork inside cuda, so
>> the code has grown around a fix for an issue which its root cause was
>> not understood, and side effects which this fix caused afterwards.
>>
>> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in
>> the link above, no libgomp.
>>
>> I didn't try the Make build.
>>
>> I would refactor the code linked above and stop using pthread_at_fork,
>> since OMP assumes it won't be initialized twice, but needs to be very
>> well tested to make sure it doesn't cause bugs or affect the fixes
>> done on the linked PRs above.
>>
>> Pedro.
>>
>> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier <cj...@gmail.com>
>> wrote:
>> >
>> > one major advantage of intel/llvm omp is that it spawns a new thread
>> pool
>> > after fork if a thread pool was already created. this is so that omp
>> can be
>> > used in the forked processes. libgomp doesn’t do this so it’ll just
>> lock up
>> > if you try to do omp in the forked process.
>> >
>> > is your build linking libgomp as well?
>> >
>> > standard mkl build (from Makefile) uses same omp library. are there
>> > problems with that build?
>> >
>> > what changes need to be made to make the assertion not fire?
>> >
>> > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy <
>> pedro.larroy.lists@gmail.com>
>> > wrote:
>> >
>> > > There's an assertion which is easily reproducible, and also there's a
>> > > crash including core dump, the latter is not easy to reproduce for me
>> > > in different environments. I have also seen mxnet getting stuck
>> > > without progressing with this build configuration and using no CPU at
>> > > all when running unit tests.
>> > >
>> > > In my view, the root cause of the assertion is that we are re-entering
>> > > OMP initialization when spawning threads on the following code through
>> > > pthread_at_fork
>> > >
>> > >
>> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
>> > >
>> > > This causes double initialization of the OMP engine, including the
>> > > assertion which you are asking about,  and I suspect some additional
>> > > overhead. That's the shady forking part you are asking for.
>> > >
>> > > A question for you: What is the cause of runtime differences between
>> > > OMP runtimes? Shouldn't the implementation overhead diminish as
>> > > threads run longer?
>> > >
>> > > Pedro.
>> > >
>> > > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier <cj...@gmail.com>
>> > > wrote:
>> > > >
>> > > > What’s the reason for the assertion failure? btw classifying an
>> assertion
>> > > > failure a “crash” is debatable. As I stated in the original issue a
>> long
>> > > > time ago, it’s possible something shady is being done with when
>> forking
>> > > > that should be fixed.  The assertion should be root caused.
>> > > >
>> > > >
>> > > >
>> > > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <
>> > > pedro.larroy.lists@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Added a dockerfile, and reports of a crash in my local machine
>> when
>> > > > > running MKL+OMP+DEBUG, with Anton's branch the crash happened as
>> well.
>> > > > > I couldn't reproduce the crash on my EC2 machine:
>> > > > > Added the backtrace of the crash as well.
>> > > > >
>> > > > > https://github.com/apache/incubator-mxnet/issues/10856
>> > > > >
>> > > > > Dockerfile here:
>> > > > >
>> > > > > https://github.com/larroy/mxnet_omp
>> > > > >
>> > > > > Kind regards.
>> > > > >
>> > > > > Pedro.
>> > > > >
>> > > > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <
>> > > marco.g.abreu@gmail.com>
>> > > > > wrote:
>> > > > > >
>> > > > > > As already proposed, I think the easiest way to get a common
>> > > > > understanding
>> > > > > > is if we start with a few docker containers. Pedro, would it be
>> > > possible
>> > > > > > for you to wrap your benchmarks into a few containers that will
>> > > produce
>> > > > > > your shown results? That way, we can avoid possible
>> > > misunderstandings and
>> > > > > > also pinpoint the exact parts where people disagree or
>> misunderstood
>> > > each
>> > > > > > other.
>> > > > > >
>> > > > > > -Marco
>> > > > > >
>> > > > > > Pedro Larroy <pe...@gmail.com> schrieb am Do.,
>> 20. Juni
>> > > > > 2019,
>> > > > > > 21:47:
>> > > > > >
>> > > > > > > I can confirm that we are linking with two versions of omp,
>> I'm
>> > > > > > > gaining more clarity into this topic, but I have still
>> questions,
>> > > the
>> > > > > > > facts that I got so far are the folllowing:
>> > > > > > >
>> > > > > > > * #1: We are linking with two versions of omp, intel's omp
>> and llvm
>> > > > > > > openmp when building with MKL enabled.
>> > > > > > > * #2: We have 3 different possible OMP versions: Intel OMP
>> (comes
>> > > with
>> > > > > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc)
>> (This
>> > > > > > > one is used on the PR proposed by Anton).
>> > > > > > >
>> > > > > > > Questions:
>> > > > > > >
>> > > > > > >  * #1 Is it ok to have two versions of openmp linked at the
>> same
>> > > time?
>> > > > > > >  * #2 Which implementation of OMP gives the best
>> performance?  (See
>> > > > > > > total training time of my measurement for a partial answer)
>> > > > > > >  * #3 Should we have a build flag so we can choose the OMP
>> version
>> > > at
>> > > > > > > runtime?
>> > > > > > >  * #4 Which Compiler and build flags did Chris use to get 10x
>> > > slowdown?
>> > > > > > >  * #5 @Stas: is there a script to replicate your benchmarks
>> > > easily? If
>> > > > > > > so could you provide a link?  I think we would need to
>> reproduce
>> > > your
>> > > > > > > benchmarks and verify which versions are being linked. It's
>> > > possible
>> > > > > > > that while compiling with MKL intel's omp was pulled in
>> instead of
>> > > > > > > GNU OpenMP.
>> > > > > > >  * #6 @Chris: how to maintain the copy of LLVM's Openmp?
>> Should we
>> > > > > > > update the subrepo regularly?
>> > > > > > >
>> > > > > > > My conclusion so far:
>> > > > > > >
>> > > > > > >  * #1 We should avoid linking two versions of omp if possible
>> and
>> > > > > > > allow users to choose one in the build as we do for BLAS.
>> > > > > > >  * #2 For performance reasons and more control vs different
>> > > compiler
>> > > > > > > versions seems it makes indeed sense to keep the LLVM OpenMP
>> > > version
>> > > > > > > in 3rdparty for now. So unless some more data is gathered, it
>> makes
>> > > > > > > sense not to remove it as of now.
>> > > > > > >  * #3 We should provide build options to choose which openmp
>> > > library
>> > > > > > > is to be used from the three options available, including
>> libgomp.
>> > > > > > >  * #4 Refining the build we could also enable OpenMP in mac
>> without
>> > > > > > > additional contortions (doesn't work as of today):
>> > > > > > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
>> > > > > > >  * #5 We should add different omp versions to our benchmarks
>> and
>> > > track
>> > > > > > > the performance, so this data is available for prescribing
>> the best
>> > > > > > > build options and for binary releases.
>> > > > > > >
>> > > > > > > This is also an interesting related gh issue posted in the
>> mkl-dnn
>> > > > > > > repository:  https://github.com/intel/mkl-dnn/issues/230
>> > > > > > >
>> > > > > > >
>> > > > > > > I don't observe the order of magnitude divergence reported by
>> > > Chris in
>> > > > > > > vanilla Ubuntu 18.04 in samples / s but the full training
>> finishes
>> > > > > > > indeed faster with the OMP from 3rdparty (LLVM openmp) vs
>> libgomp.
>> > > > > > >
>> > > > > > > There's also differences in training time when using MKL and
>> the ,
>> > > > > > > it's actually a bit slower, I don't know if it's related to
>> OMP.
>> > > > > > >
>> > > > > > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
>> > > > > > >
>> > > > > > > Anton's branch:  git@github.com:lebeg/incubator-mxnet.git
>>  branch
>> > > > > 'omp'
>> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
>> > > > > > > build/libmxnet.so |grep -i omp
>> > > > > > >         libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
>> > > > > > > (0x00007fd99a51d000)
>> > > > > > >
>> > > > > > > time python train_mnist.py
>> > > > > > >
>> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.984176
>> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 41617.00
>> samples/sec
>> > > > > > >  accuracy=1.000000
>> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 47990.69
>> samples/sec
>> > > > > > >  accuracy=0.999531
>> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 47517.01
>> samples/sec
>> > > > > > >  accuracy=0.999687
>> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 47430.53
>> samples/sec
>> > > > > > >  accuracy=1.000000
>> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 47649.77
>> samples/sec
>> > > > > > >  accuracy=0.999687
>> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 51708.12
>> samples/sec
>> > > > > > >  accuracy=0.999687
>> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 57228.63
>> samples/sec
>> > > > > > >  accuracy=0.999375
>> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 50887.85
>> samples/sec
>> > > > > > >  accuracy=0.999844
>> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 53947.98
>> samples/sec
>> > > > > > >  accuracy=0.999531
>> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
>> > > > > > > INFO:root:Epoch[19] Time cost=1.219
>> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983977
>> > > > > > > 1011.98user 26.78system 0:31.54elapsed 3292%CPU
>> (0avgtext+0avgdata
>> > > > > > > 1146052maxresident)k
>> > > > > > > 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps
>> > > > > > >
>> > > > > > > Master, MKL ON:
>> > > > > > >
>> > > > > > > (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification
>> [master]>
>> > > ldd
>> > > > > > > ../../build/libmxnet.so | grep -i omp
>> > > > > > >         libomp.so =>
>> > > > > > >
>> > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
>> > > > > > > (0x00007f05ba38f000)
>> > > > > > >         libiomp5.so =>
>> > > > > > >
>> > > > > > >
>> > > > >
>> > >
>> /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
>> > > > > > > (0x00007f05b09f4000)
>> > > > > > >
>> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.982484
>> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 36651.63
>> samples/sec
>> > > > > > >  accuracy=0.999691
>> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 45093.98
>> samples/sec
>> > > > > > >  accuracy=0.999844
>> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 45146.84
>> samples/sec
>> > > > > > >  accuracy=0.999687
>> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 45119.90
>> samples/sec
>> > > > > > >  accuracy=0.999687
>> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 44998.96
>> samples/sec
>> > > > > > >  accuracy=0.999531
>> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 45072.25
>> samples/sec
>> > > > > > >  accuracy=0.999844
>> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 44969.79
>> samples/sec
>> > > > > > >  accuracy=0.999844
>> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 44962.78
>> samples/sec
>> > > > > > >  accuracy=0.999844
>> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 44945.47
>> samples/sec
>> > > > > > >  accuracy=0.999375
>> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
>> > > > > > > INFO:root:Epoch[19] Time cost=1.367
>> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.982783
>> > > > > > > 854.97user 847.21system 0:41.44elapsed 4106%CPU
>> (0avgtext+0avgdata
>> > > > > > > 1154348maxresident)k
>> > > > > > > 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps
>> > > > > > >
>> > > > > > >
>> > > > > > > MKL OFF:
>> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> grep -i
>> MKL
>> > > > > > > cmake_options.yml
>> > > > > > > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
>> > > > > > > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL
>> found) IF
>> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
>> > > > > > > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found)
>> IF
>> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
>> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
>> > > > > > > build/libmxnet.so |grep -i omp
>> > > > > > >         libomp.so =>
>> > > > > > >
>> > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
>> > > > > > > (0x00007fb720c54000)
>> > > > > > >
>> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.983479
>> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 46784.02
>> samples/sec
>> > > > > > >  accuracy=1.000000
>> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 48824.29
>> samples/sec
>> > > > > > >  accuracy=0.999687
>> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 49190.31
>> samples/sec
>> > > > > > >  accuracy=0.999687
>> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 51518.77
>> samples/sec
>> > > > > > >  accuracy=0.999844
>> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 51551.62
>> samples/sec
>> > > > > > >  accuracy=0.999844
>> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 49026.35
>> samples/sec
>> > > > > > >  accuracy=0.999844
>> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 49002.46
>> samples/sec
>> > > > > > >  accuracy=0.999375
>> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 48980.55
>> samples/sec
>> > > > > > >  accuracy=0.999687
>> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 47402.56
>> samples/sec
>> > > > > > >  accuracy=0.999844
>> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999767
>> > > > > > > INFO:root:Epoch[19] Time cost=1.259
>> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983181
>> > > > > > > 755.36user 754.94system 0:35.89elapsed 4207%CPU
>> (0avgtext+0avgdata
>> > > > > > > 1147008maxresident)k
>> > > > > > > 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps
>> > > > > > >
>> > > > > > > Let me know what you think.
>> > > > > > >
>> > > > > > > Link to the original PR:
>> > > > > > > https://github.com/apache/incubator-mxnet/pull/12160
>> > > > > > >
>> > > > > > > Thanks.
>> > > > > > >
>> > > > > > > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
>> > > > > > > <ke...@gmail.com> wrote:
>> > > > > > > >
>> > > > > > > > "if you’re linking in two then you’re doing something
>> wrong."
>> > > > > Correct,
>> > > > > > > > that's one thing I believe we've got consensus on.  So
>> let's call
>> > > > > that
>> > > > > > > out
>> > > > > > > > as a bug to be fixed.
>> > > > > > > >
>> > > > > > > > Let's move forward with some reproducible numbers and then
>> > > discuss
>> > > > > the
>> > > > > > > pros
>> > > > > > > > / cons of which particular OMP implementation we should use.
>> > > > > > > >
>> > > > > > > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <
>> > > > > > > pedro.larroy.lists@gmail.com>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi Chris
>> > > > > > > > >
>> > > > > > > > > I would ask you to have a bit of patience and help us
>> with your
>> > > > > > > > > experience in this matter. Nobody is ignoring anything, I
>> > > think we
>> > > > > are
>> > > > > > > > > individually gathering feedbacks and trying to understand
>> the
>> > > > > multiple
>> > > > > > > > > contributions done to this topic including yours, then go
>> step
>> > > by
>> > > > > > > > > step, understand what is going on and run experiments and
>> > > report
>> > > > > back
>> > > > > > > > > to the list or the corresponding github item. It was
>> suggested
>> > > by
>> > > > > > > > > Kellen to prepare some containers, this takes effort.
>> > > > > > > > >
>> > > > > > > > > Regarding your final comment, most of us also have many
>> other
>> > > > > things
>> > > > > > > > > to do and responsibilities even if our daytime jobs might
>> > > involve
>> > > > > > > > > MXNet in some form or another. I think that's part of the
>> > > privilege
>> > > > > > > > > and responsibility of working close with an open source
>> > > project and
>> > > > > > > > > the magic of collaboration across organizations. Let's
>> all be
>> > > > > patient
>> > > > > > > > > and take some time to understand and reason about this
>> topic
>> > > which
>> > > > > is
>> > > > > > > > > not simple. Since we decided to step back and gather more
>> data
>> > > > > let's
>> > > > > > > > > take time and do it properly.
>> > > > > > > > >
>> > > > > > > > > Personally I hope to find time to look again into this
>> issue
>> > > before
>> > > > > > > > > the end of the week.
>> > > > > > > > >
>> > > > > > > > > Thanks.
>> > > > > > > > >
>> > > > > > > > > Pedro.
>> > > > > > > > >
>> > > > > > > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <
>> > > > > cjolivier01@apache.org>
>> > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > if you’re linking in two then you’re doing something
>> wrong.
>> > > You
>> > > > > can
>> > > > > > > see
>> > > > > > > > > by
>> > > > > > > > > > my email yesterday that only one is linked in. This is
>> also
>> > > the
>> > > > > case
>> > > > > > > with
>> > > > > > > > > > the mkl version built by the Makefile — only the Intel
>> OMP
>> > > > > library is
>> > > > > > > > > used
>> > > > > > > > > > (no libgomp).
>> > > > > > > > > >
>> > > > > > > > > > That being said, Do you have clear evidence that using
>> Intel
>> > > OMP
>> > > > > is
>> > > > > > > both
>> > > > > > > > > > problematic and the situation isn’t fixable?  The
>> burden of
>> > > > > proof is
>> > > > > > > on
>> > > > > > > > > the
>> > > > > > > > > > ones requesting the change — it is not my
>> responsibility to
>> > > > > justify
>> > > > > > > the
>> > > > > > > > > > current state.  There must be something “terrible” and
>> > > unfixable
>> > > > > to
>> > > > > > > > > justify
>> > > > > > > > > > a change.  I have seen no proof of this in all this
>> time.
>> > > > > > > > > >
>> > > > > > > > > > On a side note, I mentioned a couple of things in my
>> email
>> > > > > yesterday
>> > > > > > > that
>> > > > > > > > > > still are not being responded to (they were also
>> ignored in
>> > > the
>> > > > > last
>> > > > > > > > > > incarnation of this “discussion” — I have much
>> experience in
>> > > this
>> > > > > > > matter
>> > > > > > > > > to
>> > > > > > > > > > assume “discussion” is a waste of my time, seeing and I
>> am
>> > > not
>> > > > > paid
>> > > > > > > to
>> > > > > > > > > > “work on” mxnet like y’all are).
>> > > > > > > > > >
>> > > > > > > > > > -C
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
>> > > > > > > > > > kellen.sunderland@gmail.com> wrote:
>> > > > > > > > > >
>> > > > > > > > > > > I've also quite often seen two versions of OpenMP
>> linked.
>> > > I
>> > > > > think
>> > > > > > > we
>> > > > > > > > > can
>> > > > > > > > > > > all agree we probably want to avoid linking in two
>> > > libraries
>> > > > > that
>> > > > > > > do
>> > > > > > > > > > > effectively the same thing.
>> > > > > > > > > > >
>> > > > > > > > > > > The performance questions should be fairly straight
>> > > forward to
>> > > > > > > > > demonstrate
>> > > > > > > > > > > right?  Could we just collaborate on a few minimal
>> > > Dockerfiles
>> > > > > that
>> > > > > > > > > show
>> > > > > > > > > > > (or don't show) Intel OpenMP performance speedups
>> with the
>> > > > > > > workloads
>> > > > > > > > > Chris
>> > > > > > > > > > > is referencing?
>> > > > > > > > > > >
>> > > > > > > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
>> > > > > > > > > > > stanislav.tsukrov@gmail.com> wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > Hi, Chris!
>> > > > > > > > > > > >
>> > > > > > > > > > > > Stas here - I've gathered that performance data.
>> > > > > > > > > > > > Sure thing, I can be wrong, but please elaborate a
>> bit on
>> > > > > what
>> > > > > > > we are
>> > > > > > > > > > > > missing.
>> > > > > > > > > > > > Be assured, intentional misdirection was never a
>> case.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Thanks a lot for being constructive.
>> > > > > > > > > > > >
>> > > > > > > > > > > > > Turning Intel OMP on and off (and MKL as well,
>> since it
>> > > > > tends
>> > > > > > > to
>> > > > > > > > > pull
>> > > > > > > > > > > in
>> > > > > > > > > > > > omp, depending which one is linked in).
>> > > > > > > > > > > >
>> > > > > > > > > > > > We never ever considered turning MKL off. We are on
>> the
>> > > same
>> > > > > page
>> > > > > > > > > here -
>> > > > > > > > > > > > MKL is crucial for the performance.
>> > > > > > > > > > > > Why should we? There's a GOMP-linked version of MKL,
>> > > that we
>> > > > > can
>> > > > > > > use.
>> > > > > > > > > > > >
>> > > > > > > > > > > > What we did - we measured, if using compilers
>> default
>> > > OpenMP
>> > > > > > > > > > > > implementation instead of referenced source code
>> > > > > distribution of
>> > > > > > > > > OpenMP
>> > > > > > > > > > > > makes anything slower.
>> > > > > > > > > > > > We have found the impact to be hardly measurable.
>> > > > > > > > > > > > The difference between GOMP and iOMP is <5% on our
>> > > > > benchmarks,
>> > > > > > > most
>> > > > > > > > > of
>> > > > > > > > > > > the
>> > > > > > > > > > > > time less than that.
>> > > > > > > > > > > >
>> > > > > > > > > > > > We just suggest to simplify the build of mxnet, by
>> > > removing
>> > > > > the
>> > > > > > > > > > > > unnecessary dependency.
>> > > > > > > > > > > >
>> > > > > > > > > > > > During that we discovered for example the following
>> > > amazing
>> > > > > > > issue:
>> > > > > > > > > > > >
>> https://github.com/apache/incubator-mxnet/issues/14087
>> > > > > > > > > > > >
>> > > > > > > > > > > > Best Regards
>> > > > > > > > > > > >
>> > > > > > > > > > > > Stas
>> > > > > > > > > > > >
>> > > > > > > > > > > > On 18.06.19, 18:24, "Chris Olivier" <
>> > > cjolivier01@gmail.com>
>> > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > >     I am very reluctant to feed the trolls again,
>> and
>> > > this
>> > > > > will
>> > > > > > > be
>> > > > > > > > > teh
>> > > > > > > > > > > last
>> > > > > > > > > > > >     time I address Pedro or Anton on the subject,
>> but
>> > > since I
>> > > > > > > think
>> > > > > > > > > the
>> > > > > > > > > > > > numbers
>> > > > > > > > > > > >     being presented are incorrect (either by te
>> builders
>> > > not
>> > > > > > > really
>> > > > > > > > > > > >     understanding what they are building, or
>> possibly
>> > > > > intentional
>> > > > > > > > > > > > misdirection):
>> > > > > > > > > > > >
>> > > > > > > > > > > >     Turning Intel OMP on and off (and MKL as well,
>> since
>> > > it
>> > > > > > > tends to
>> > > > > > > > > pull
>> > > > > > > > > > > > in
>> > > > > > > > > > > >     omp, depending which one is linked in).
>> > > > > > > > > > > >     There is a HUGE difference.  This is consistent
>> with
>> > > my
>> > > > > > > > > experience
>> > > > > > > > > > > > before
>> > > > > > > > > > > >     when it was added.
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >     default mnist:
>> > > > > > > > > > > >
>> > > > > > > > > > > >     python
>> ../example/image-classification/train_mnist.py
>> > > > > > > > > > > >     INFO:root:start with arguments
>> > > Namespace(add_stn=False,
>> > > > > > > > > > > batch_size=64,
>> > > > > > > > > > > >     disp_batches=100, dtype='float32',
>> gc_threshold=0.5,
>> > > > > > > > > gc_type='none',
>> > > > > > > > > > > >     gpus=None, image_shape='1, 28, 28',
>> > > > > initializer='default',
>> > > > > > > > > > > >     kv_store='device', load_epoch=None, loss='',
>> lr=0.05,
>> > > > > > > > > lr_factor=0.1,
>> > > > > > > > > > > >     lr_step_epochs='10', macrobatch_size=0,
>> > > > > model_prefix=None,
>> > > > > > > > > mom=0.9,
>> > > > > > > > > > > >     monitor=0, network='mlp', num_classes=10,
>> > > num_epochs=20,
>> > > > > > > > > > > >     num_examples=60000, num_layers=None,
>> optimizer='sgd',
>> > > > > > > > > > > >     profile_server_suffix='',
>> profile_worker_suffix='',
>> > > > > > > > > save_period=1,
>> > > > > > > > > > > >     test_io=0, top_k=0, warmup_epochs=5,
>> > > > > > > warmup_strategy='linear',
>> > > > > > > > > > > > wd=0.0001)
>> > > > > > > > > > > >
>> > > > > > > > > > > >     INTEL OMP:
>> > > > > > > > > > > >
>> > > > > > > > > > > >     ldd libmxnet.so | grep omp
>> > > > > > > > > > > >             libomp.so =>
>> > > > > > > > > > > >
>> > > > > > > > >
>> > > > >
>> /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
>> > > > > > > > > > > >     (0x00007f978fde7000)
>> > > > > > > > > > > >
>> > > > > > > > > > > >     :root:Epoch[0] Batch [0-100]        Speed:
>> 31548.09
>> > > > > > > samples/sec
>> > > > > > > > > > > >     accuracy=0.780012
>> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed:
>> > > 16073.21
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >     accuracy=0.920469
>> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed:
>> > > 19075.91
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >     accuracy=0.928281
>> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed:
>> > > 23211.36
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >     accuracy=0.942813
>> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed:
>> > > 22139.79
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >     accuracy=0.938750
>> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed:
>> > > 23225.52
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >     accuracy=0.946562
>> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed:
>> > > 19547.41
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >     accuracy=0.953281
>> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed:
>> > > 24111.73
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >     accuracy=0.951562
>> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed:
>> > > 13959.88
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >     accuracy=0.957500
>> > > > > > > > > > > >     INFO:root:Epoch[0] Train-accuracy=0.925423
>> > > > > > > > > > > >     INFO:root:Epoch[0] Time cost=3.806
>> > > > > > > > > > > >     INFO:root:Epoch[0] Validation-accuracy=0.962580
>> > > > > > > > > > > >     INFO:root:Epoch[1] Batch [0-100]        Speed:
>> > > 24560.21
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >     accuracy=0.968131
>> > > > > > > > > > > >     INFO:root:Epoch[1] Batch [100-200]      Speed:
>> > > 23457.03
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >     accuracy=0.966250
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >     LIBGOMP:
>> > > > > > > > > > > >
>> > > > > > > > > > > >     ldd libmxnet.so | grep omp
>> > > > > > > > > > > >             libgomp.so.1 =>
>> > > > > > > /usr/lib/x86_64-linux-gnu/libgomp.so.1
>> > > > > > > > > > > >     (0x00007f25c25dd000)
>> > > > > > > > > > > >
>> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [0-100]        Speed:
>> > > 1731.01
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >      accuracy=0.782488
>> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed:
>> > > 3551.32
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >      accuracy=0.907813
>> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed:
>> > > 1991.00
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >      accuracy=0.927188
>> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed:
>> > > 2175.45
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >      accuracy=0.937969
>> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed:
>> > > 1644.95
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >      accuracy=0.942187
>> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed:
>> > > 6444.58
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >      accuracy=0.950156
>> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed:
>> > > 7842.16
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >      accuracy=0.947969
>> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed:
>> > > 9412.07
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >      accuracy=0.953750
>> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed:
>> > > 12707.58
>> > > > > > > > > samples/sec
>> > > > > > > > > > > >     accuracy=0.953125
>> > > > > > > > > > > >
>> > > > > > > > > > > >     That being said, there's other issued beyond
>> speed.
>> > > The
>> > > > > > > DEFAULT
>> > > > > > > > > > > build
>> > > > > > > > > > > > from
>> > > > > > > > > > > >     makefile (not CMake) uses Intel OMP mkl (I
>> showed
>> > > > > before) and
>> > > > > > > > > > > > mysteriously
>> > > > > > > > > > > >     it has no issues?  This seems highly suspicious.
>> > > All I
>> > > > > see
>> > > > > > > is a
>> > > > > > > > > lot
>> > > > > > > > > > > of
>> > > > > > > > > > > >     hand-waving and conjecture and pointing to
>> > > StackOverflow
>> > > > > > > posts
>> > > > > > > > > made
>> > > > > > > > > > > by
>> > > > > > > > > > > >     people who may be of questionable pedigree to
>> begin
>> > > with.
>> > > > > > > This
>> > > > > > > > > > > smells
>> > > > > > > > > > > > of a
>> > > > > > > > > > > >     Pedro-ego-fight rather than one of purely
>> technical
>> > > > > merit.
>> > > > > > > > > Also, if
>> > > > > > > > > > > > one
>> > > > > > > > > > > >     knows how OMP works,  they would be very
>> suspicious
>> > > of
>> > > > > the
>> > > > > > > > > > > > "intermittent
>> > > > > > > > > > > >     hangs" claim -- that's probably just broken race
>> > > > > conditions
>> > > > > > > > > elsewhere
>> > > > > > > > > > > > until
>> > > > > > > > > > > >     proven differently.  It'd tend freeze on the
>> first
>> > > use if
>> > > > > > > > > something
>> > > > > > > > > > > is
>> > > > > > > > > > > >     wrong (try using libgomp after a fork and see),
>> since
>> > > > > worker
>> > > > > > > > > threads"
>> > > > > > > > > > > >     wouldn't be assigned/joined properly.  IntelOMP
>> is
>> > > > > faster,
>> > > > > > > but
>> > > > > > > > > also
>> > > > > > > > > > > has
>> > > > > > > > > > > >     other advantages, such as allowing OMP after a
>> fork.
>> > > > > > > > > > > >
>> > > > > > > > > > > >     I actually addressed a lot of issues and ask for
>> > > > > > > clarification
>> > > > > > > > > in the
>> > > > > > > > > > > >     original PR's way back when, but they're all
>> just
>> > > > > ignored.
>> > > > > > > > > > > >
>> > > > > > > > > > > >     -Chris
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > >
>> > > > > > >
>> > > > >
>> > >
>>
>

Re: OMP

Posted by Chris Olivier <cj...@gmail.com>.
That doesnt look like it has anything to do with omp

On Mon, Jun 24, 2019 at 6:40 PM kellen sunderland <
kellen.sunderland@gmail.com> wrote:

> I remember this hang as well, it was pretty hard to reproduce IIRC.  I
> believe the stacks for the hang are here:
> https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600
> and
> the trick was we could only debug it up to the point that we hit:
>
> #0  0x00007fec6df1ba4f in futex_wait (private=0, expected=1,
> futex_word=0x7fec60843758)
> at ../sysdeps/unix/sysv/linux/futex-internal.h:61
> #1  futex_wait_simple (private=0, expected=1, futex_word=0x7fec60843758)
>     at ../sysdeps/nptl/futex-internal.h:135
> #2  __pthread_once_slow (once_control=0x7fec60843758,
> init_routine=0x7fec605f38f0)
>     at pthread_once.c:105
> ...
> #6  0x00007fec6061c577 in cudaSetDevice () from
> /usr/local/cuda/lib64/libcudart.so.9.0
>
> because the code in libcudart is obviously closed source we couldn't dig
> into what threading work was going on when we called cudaSetDevice.
>
> On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> wrote:
>
> > If you check initialize.cc we seem to be explicitly disabling that
> > behaviour in pthread_at_fork which seems to cause thread contention
> > during multiprocessing. Why do we need this major advantage for the
> > library if that's the case?
> >
> > Related PRs:
> >
> > https://github.com/apache/incubator-mxnet/pull/10820
> > https://github.com/apache/incubator-mxnet/issues/14396
> >
> > The original code was authored in this PR:
> >
> > https://github.com/apache/incubator-mxnet/pull/8677
> >
> > I actually remember this fix, it was done during a release as the cuda
> > runtime was forking and the engine was being re-entered. If that
> > situation is now happening anymore it might not be needed any longer.
> > I don't think we know the cause why there was a fork inside cuda, so
> > the code has grown around a fix for an issue which its root cause was
> > not understood, and side effects which this fix caused afterwards.
> >
> > My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in
> > the link above, no libgomp.
> >
> > I didn't try the Make build.
> >
> > I would refactor the code linked above and stop using pthread_at_fork,
> > since OMP assumes it won't be initialized twice, but needs to be very
> > well tested to make sure it doesn't cause bugs or affect the fixes
> > done on the linked PRs above.
> >
> > Pedro.
> >
> > On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier <cj...@gmail.com>
> > wrote:
> > >
> > > one major advantage of intel/llvm omp is that it spawns a new thread
> pool
> > > after fork if a thread pool was already created. this is so that omp
> can
> > be
> > > used in the forked processes. libgomp doesn’t do this so it’ll just
> lock
> > up
> > > if you try to do omp in the forked process.
> > >
> > > is your build linking libgomp as well?
> > >
> > > standard mkl build (from Makefile) uses same omp library. are there
> > > problems with that build?
> > >
> > > what changes need to be made to make the assertion not fire?
> > >
> > > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy <
> > pedro.larroy.lists@gmail.com>
> > > wrote:
> > >
> > > > There's an assertion which is easily reproducible, and also there's a
> > > > crash including core dump, the latter is not easy to reproduce for me
> > > > in different environments. I have also seen mxnet getting stuck
> > > > without progressing with this build configuration and using no CPU at
> > > > all when running unit tests.
> > > >
> > > > In my view, the root cause of the assertion is that we are
> re-entering
> > > > OMP initialization when spawning threads on the following code
> through
> > > > pthread_at_fork
> > > >
> > > >
> >
> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
> > > >
> > > > This causes double initialization of the OMP engine, including the
> > > > assertion which you are asking about,  and I suspect some additional
> > > > overhead. That's the shady forking part you are asking for.
> > > >
> > > > A question for you: What is the cause of runtime differences between
> > > > OMP runtimes? Shouldn't the implementation overhead diminish as
> > > > threads run longer?
> > > >
> > > > Pedro.
> > > >
> > > > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier <cjolivier01@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > What’s the reason for the assertion failure? btw classifying an
> > assertion
> > > > > failure a “crash” is debatable. As I stated in the original issue a
> > long
> > > > > time ago, it’s possible something shady is being done with when
> > forking
> > > > > that should be fixed.  The assertion should be root caused.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <
> > > > pedro.larroy.lists@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Added a dockerfile, and reports of a crash in my local machine
> when
> > > > > > running MKL+OMP+DEBUG, with Anton's branch the crash happened as
> > well.
> > > > > > I couldn't reproduce the crash on my EC2 machine:
> > > > > > Added the backtrace of the crash as well.
> > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/issues/10856
> > > > > >
> > > > > > Dockerfile here:
> > > > > >
> > > > > > https://github.com/larroy/mxnet_omp
> > > > > >
> > > > > > Kind regards.
> > > > > >
> > > > > > Pedro.
> > > > > >
> > > > > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <
> > > > marco.g.abreu@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > As already proposed, I think the easiest way to get a common
> > > > > > understanding
> > > > > > > is if we start with a few docker containers. Pedro, would it be
> > > > possible
> > > > > > > for you to wrap your benchmarks into a few containers that will
> > > > produce
> > > > > > > your shown results? That way, we can avoid possible
> > > > misunderstandings and
> > > > > > > also pinpoint the exact parts where people disagree or
> > misunderstood
> > > > each
> > > > > > > other.
> > > > > > >
> > > > > > > -Marco
> > > > > > >
> > > > > > > Pedro Larroy <pe...@gmail.com> schrieb am Do.,
> 20.
> > Juni
> > > > > > 2019,
> > > > > > > 21:47:
> > > > > > >
> > > > > > > > I can confirm that we are linking with two versions of omp,
> I'm
> > > > > > > > gaining more clarity into this topic, but I have still
> > questions,
> > > > the
> > > > > > > > facts that I got so far are the folllowing:
> > > > > > > >
> > > > > > > > * #1: We are linking with two versions of omp, intel's omp
> and
> > llvm
> > > > > > > > openmp when building with MKL enabled.
> > > > > > > > * #2: We have 3 different possible OMP versions: Intel OMP
> > (comes
> > > > with
> > > > > > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc)
> > (This
> > > > > > > > one is used on the PR proposed by Anton).
> > > > > > > >
> > > > > > > > Questions:
> > > > > > > >
> > > > > > > >  * #1 Is it ok to have two versions of openmp linked at the
> > same
> > > > time?
> > > > > > > >  * #2 Which implementation of OMP gives the best performance?
> > (See
> > > > > > > > total training time of my measurement for a partial answer)
> > > > > > > >  * #3 Should we have a build flag so we can choose the OMP
> > version
> > > > at
> > > > > > > > runtime?
> > > > > > > >  * #4 Which Compiler and build flags did Chris use to get 10x
> > > > slowdown?
> > > > > > > >  * #5 @Stas: is there a script to replicate your benchmarks
> > > > easily? If
> > > > > > > > so could you provide a link?  I think we would need to
> > reproduce
> > > > your
> > > > > > > > benchmarks and verify which versions are being linked. It's
> > > > possible
> > > > > > > > that while compiling with MKL intel's omp was pulled in
> > instead of
> > > > > > > > GNU OpenMP.
> > > > > > > >  * #6 @Chris: how to maintain the copy of LLVM's Openmp?
> > Should we
> > > > > > > > update the subrepo regularly?
> > > > > > > >
> > > > > > > > My conclusion so far:
> > > > > > > >
> > > > > > > >  * #1 We should avoid linking two versions of omp if possible
> > and
> > > > > > > > allow users to choose one in the build as we do for BLAS.
> > > > > > > >  * #2 For performance reasons and more control vs different
> > > > compiler
> > > > > > > > versions seems it makes indeed sense to keep the LLVM OpenMP
> > > > version
> > > > > > > > in 3rdparty for now. So unless some more data is gathered, it
> > makes
> > > > > > > > sense not to remove it as of now.
> > > > > > > >  * #3 We should provide build options to choose which openmp
> > > > library
> > > > > > > > is to be used from the three options available, including
> > libgomp.
> > > > > > > >  * #4 Refining the build we could also enable OpenMP in mac
> > without
> > > > > > > > additional contortions (doesn't work as of today):
> > > > > > > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
> > > > > > > >  * #5 We should add different omp versions to our benchmarks
> > and
> > > > track
> > > > > > > > the performance, so this data is available for prescribing
> the
> > best
> > > > > > > > build options and for binary releases.
> > > > > > > >
> > > > > > > > This is also an interesting related gh issue posted in the
> > mkl-dnn
> > > > > > > > repository:  https://github.com/intel/mkl-dnn/issues/230
> > > > > > > >
> > > > > > > >
> > > > > > > > I don't observe the order of magnitude divergence reported by
> > > > Chris in
> > > > > > > > vanilla Ubuntu 18.04 in samples / s but the full training
> > finishes
> > > > > > > > indeed faster with the OMP from 3rdparty (LLVM openmp) vs
> > libgomp.
> > > > > > > >
> > > > > > > > There's also differences in training time when using MKL and
> > the ,
> > > > > > > > it's actually a bit slower, I don't know if it's related to
> > OMP.
> > > > > > > >
> > > > > > > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
> > > > > > > >
> > > > > > > > Anton's branch:  git@github.com:lebeg/incubator-mxnet.git
> >  branch
> > > > > > 'omp'
> > > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> > > > > > > > build/libmxnet.so |grep -i omp
> > > > > > > >         libgomp.so.1 =>
> /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > > > > (0x00007fd99a51d000)
> > > > > > > >
> > > > > > > > time python train_mnist.py
> > > > > > > >
> > > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.984176
> > > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 41617.00
> > samples/sec
> > > > > > > >  accuracy=1.000000
> > > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 47990.69
> > samples/sec
> > > > > > > >  accuracy=0.999531
> > > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 47517.01
> > samples/sec
> > > > > > > >  accuracy=0.999687
> > > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 47430.53
> > samples/sec
> > > > > > > >  accuracy=1.000000
> > > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 47649.77
> > samples/sec
> > > > > > > >  accuracy=0.999687
> > > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 51708.12
> > samples/sec
> > > > > > > >  accuracy=0.999687
> > > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 57228.63
> > samples/sec
> > > > > > > >  accuracy=0.999375
> > > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 50887.85
> > samples/sec
> > > > > > > >  accuracy=0.999844
> > > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 53947.98
> > samples/sec
> > > > > > > >  accuracy=0.999531
> > > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > > > > > > INFO:root:Epoch[19] Time cost=1.219
> > > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983977
> > > > > > > > 1011.98user 26.78system 0:31.54elapsed 3292%CPU
> > (0avgtext+0avgdata
> > > > > > > > 1146052maxresident)k
> > > > > > > > 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps
> > > > > > > >
> > > > > > > > Master, MKL ON:
> > > > > > > >
> > > > > > > > (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification
> > [master]>
> > > > ldd
> > > > > > > > ../../build/libmxnet.so | grep -i omp
> > > > > > > >         libomp.so =>
> > > > > > > >
> > > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > > > > > > (0x00007f05ba38f000)
> > > > > > > >         libiomp5.so =>
> > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >
> /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
> > > > > > > > (0x00007f05b09f4000)
> > > > > > > >
> > > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.982484
> > > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 36651.63
> > samples/sec
> > > > > > > >  accuracy=0.999691
> > > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 45093.98
> > samples/sec
> > > > > > > >  accuracy=0.999844
> > > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 45146.84
> > samples/sec
> > > > > > > >  accuracy=0.999687
> > > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 45119.90
> > samples/sec
> > > > > > > >  accuracy=0.999687
> > > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 44998.96
> > samples/sec
> > > > > > > >  accuracy=0.999531
> > > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 45072.25
> > samples/sec
> > > > > > > >  accuracy=0.999844
> > > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 44969.79
> > samples/sec
> > > > > > > >  accuracy=0.999844
> > > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 44962.78
> > samples/sec
> > > > > > > >  accuracy=0.999844
> > > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 44945.47
> > samples/sec
> > > > > > > >  accuracy=0.999375
> > > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > > > > > > INFO:root:Epoch[19] Time cost=1.367
> > > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.982783
> > > > > > > > 854.97user 847.21system 0:41.44elapsed 4106%CPU
> > (0avgtext+0avgdata
> > > > > > > > 1154348maxresident)k
> > > > > > > > 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps
> > > > > > > >
> > > > > > > >
> > > > > > > > MKL OFF:
> > > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> grep -i
> > MKL
> > > > > > > > cmake_options.yml
> > > > > > > > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
> > > > > > > > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL
> > found) IF
> > > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > > > > > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found)
> IF
> > > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
> > > > > > > > build/libmxnet.so |grep -i omp
> > > > > > > >         libomp.so =>
> > > > > > > >
> > > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > > > > > > (0x00007fb720c54000)
> > > > > > > >
> > > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.983479
> > > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 46784.02
> > samples/sec
> > > > > > > >  accuracy=1.000000
> > > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 48824.29
> > samples/sec
> > > > > > > >  accuracy=0.999687
> > > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 49190.31
> > samples/sec
> > > > > > > >  accuracy=0.999687
> > > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 51518.77
> > samples/sec
> > > > > > > >  accuracy=0.999844
> > > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 51551.62
> > samples/sec
> > > > > > > >  accuracy=0.999844
> > > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 49026.35
> > samples/sec
> > > > > > > >  accuracy=0.999844
> > > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 49002.46
> > samples/sec
> > > > > > > >  accuracy=0.999375
> > > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 48980.55
> > samples/sec
> > > > > > > >  accuracy=0.999687
> > > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 47402.56
> > samples/sec
> > > > > > > >  accuracy=0.999844
> > > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999767
> > > > > > > > INFO:root:Epoch[19] Time cost=1.259
> > > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983181
> > > > > > > > 755.36user 754.94system 0:35.89elapsed 4207%CPU
> > (0avgtext+0avgdata
> > > > > > > > 1147008maxresident)k
> > > > > > > > 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps
> > > > > > > >
> > > > > > > > Let me know what you think.
> > > > > > > >
> > > > > > > > Link to the original PR:
> > > > > > > > https://github.com/apache/incubator-mxnet/pull/12160
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
> > > > > > > > <ke...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > "if you’re linking in two then you’re doing something
> wrong."
> > > > > > Correct,
> > > > > > > > > that's one thing I believe we've got consensus on.  So
> let's
> > call
> > > > > > that
> > > > > > > > out
> > > > > > > > > as a bug to be fixed.
> > > > > > > > >
> > > > > > > > > Let's move forward with some reproducible numbers and then
> > > > discuss
> > > > > > the
> > > > > > > > pros
> > > > > > > > > / cons of which particular OMP implementation we should
> use.
> > > > > > > > >
> > > > > > > > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <
> > > > > > > > pedro.larroy.lists@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Chris
> > > > > > > > > >
> > > > > > > > > > I would ask you to have a bit of patience and help us
> with
> > your
> > > > > > > > > > experience in this matter. Nobody is ignoring anything, I
> > > > think we
> > > > > > are
> > > > > > > > > > individually gathering feedbacks and trying to understand
> > the
> > > > > > multiple
> > > > > > > > > > contributions done to this topic including yours, then go
> > step
> > > > by
> > > > > > > > > > step, understand what is going on and run experiments and
> > > > report
> > > > > > back
> > > > > > > > > > to the list or the corresponding github item. It was
> > suggested
> > > > by
> > > > > > > > > > Kellen to prepare some containers, this takes effort.
> > > > > > > > > >
> > > > > > > > > > Regarding your final comment, most of us also have many
> > other
> > > > > > things
> > > > > > > > > > to do and responsibilities even if our daytime jobs might
> > > > involve
> > > > > > > > > > MXNet in some form or another. I think that's part of the
> > > > privilege
> > > > > > > > > > and responsibility of working close with an open source
> > > > project and
> > > > > > > > > > the magic of collaboration across organizations. Let's
> all
> > be
> > > > > > patient
> > > > > > > > > > and take some time to understand and reason about this
> > topic
> > > > which
> > > > > > is
> > > > > > > > > > not simple. Since we decided to step back and gather more
> > data
> > > > > > let's
> > > > > > > > > > take time and do it properly.
> > > > > > > > > >
> > > > > > > > > > Personally I hope to find time to look again into this
> > issue
> > > > before
> > > > > > > > > > the end of the week.
> > > > > > > > > >
> > > > > > > > > > Thanks.
> > > > > > > > > >
> > > > > > > > > > Pedro.
> > > > > > > > > >
> > > > > > > > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <
> > > > > > cjolivier01@apache.org>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > if you’re linking in two then you’re doing something
> > wrong.
> > > > You
> > > > > > can
> > > > > > > > see
> > > > > > > > > > by
> > > > > > > > > > > my email yesterday that only one is linked in. This is
> > also
> > > > the
> > > > > > case
> > > > > > > > with
> > > > > > > > > > > the mkl version built by the Makefile — only the Intel
> > OMP
> > > > > > library is
> > > > > > > > > > used
> > > > > > > > > > > (no libgomp).
> > > > > > > > > > >
> > > > > > > > > > > That being said, Do you have clear evidence that using
> > Intel
> > > > OMP
> > > > > > is
> > > > > > > > both
> > > > > > > > > > > problematic and the situation isn’t fixable?  The
> burden
> > of
> > > > > > proof is
> > > > > > > > on
> > > > > > > > > > the
> > > > > > > > > > > ones requesting the change — it is not my
> responsibility
> > to
> > > > > > justify
> > > > > > > > the
> > > > > > > > > > > current state.  There must be something “terrible” and
> > > > unfixable
> > > > > > to
> > > > > > > > > > justify
> > > > > > > > > > > a change.  I have seen no proof of this in all this
> time.
> > > > > > > > > > >
> > > > > > > > > > > On a side note, I mentioned a couple of things in my
> > email
> > > > > > yesterday
> > > > > > > > that
> > > > > > > > > > > still are not being responded to (they were also
> ignored
> > in
> > > > the
> > > > > > last
> > > > > > > > > > > incarnation of this “discussion” — I have much
> > experience in
> > > > this
> > > > > > > > matter
> > > > > > > > > > to
> > > > > > > > > > > assume “discussion” is a waste of my time, seeing and I
> > am
> > > > not
> > > > > > paid
> > > > > > > > to
> > > > > > > > > > > “work on” mxnet like y’all are).
> > > > > > > > > > >
> > > > > > > > > > > -C
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> > > > > > > > > > > kellen.sunderland@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I've also quite often seen two versions of OpenMP
> > linked.
> > > > I
> > > > > > think
> > > > > > > > we
> > > > > > > > > > can
> > > > > > > > > > > > all agree we probably want to avoid linking in two
> > > > libraries
> > > > > > that
> > > > > > > > do
> > > > > > > > > > > > effectively the same thing.
> > > > > > > > > > > >
> > > > > > > > > > > > The performance questions should be fairly straight
> > > > forward to
> > > > > > > > > > demonstrate
> > > > > > > > > > > > right?  Could we just collaborate on a few minimal
> > > > Dockerfiles
> > > > > > that
> > > > > > > > > > show
> > > > > > > > > > > > (or don't show) Intel OpenMP performance speedups
> with
> > the
> > > > > > > > workloads
> > > > > > > > > > Chris
> > > > > > > > > > > > is referencing?
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> > > > > > > > > > > > stanislav.tsukrov@gmail.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi, Chris!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Stas here - I've gathered that performance data.
> > > > > > > > > > > > > Sure thing, I can be wrong, but please elaborate a
> > bit on
> > > > > > what
> > > > > > > > we are
> > > > > > > > > > > > > missing.
> > > > > > > > > > > > > Be assured, intentional misdirection was never a
> > case.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks a lot for being constructive.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Turning Intel OMP on and off (and MKL as well,
> > since it
> > > > > > tends
> > > > > > > > to
> > > > > > > > > > pull
> > > > > > > > > > > > in
> > > > > > > > > > > > > omp, depending which one is linked in).
> > > > > > > > > > > > >
> > > > > > > > > > > > > We never ever considered turning MKL off. We are on
> > the
> > > > same
> > > > > > page
> > > > > > > > > > here -
> > > > > > > > > > > > > MKL is crucial for the performance.
> > > > > > > > > > > > > Why should we? There's a GOMP-linked version of
> MKL,
> > > > that we
> > > > > > can
> > > > > > > > use.
> > > > > > > > > > > > >
> > > > > > > > > > > > > What we did - we measured, if using compilers
> default
> > > > OpenMP
> > > > > > > > > > > > > implementation instead of referenced source code
> > > > > > distribution of
> > > > > > > > > > OpenMP
> > > > > > > > > > > > > makes anything slower.
> > > > > > > > > > > > > We have found the impact to be hardly measurable.
> > > > > > > > > > > > > The difference between GOMP and iOMP is <5% on our
> > > > > > benchmarks,
> > > > > > > > most
> > > > > > > > > > of
> > > > > > > > > > > > the
> > > > > > > > > > > > > time less than that.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We just suggest to simplify the build of mxnet, by
> > > > removing
> > > > > > the
> > > > > > > > > > > > > unnecessary dependency.
> > > > > > > > > > > > >
> > > > > > > > > > > > > During that we discovered for example the following
> > > > amazing
> > > > > > > > issue:
> > > > > > > > > > > > >
> > https://github.com/apache/incubator-mxnet/issues/14087
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best Regards
> > > > > > > > > > > > >
> > > > > > > > > > > > > Stas
> > > > > > > > > > > > >
> > > > > > > > > > > > > On 18.06.19, 18:24, "Chris Olivier" <
> > > > cjolivier01@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >     I am very reluctant to feed the trolls again,
> and
> > > > this
> > > > > > will
> > > > > > > > be
> > > > > > > > > > teh
> > > > > > > > > > > > last
> > > > > > > > > > > > >     time I address Pedro or Anton on the subject,
> but
> > > > since I
> > > > > > > > think
> > > > > > > > > > the
> > > > > > > > > > > > > numbers
> > > > > > > > > > > > >     being presented are incorrect (either by te
> > builders
> > > > not
> > > > > > > > really
> > > > > > > > > > > > >     understanding what they are building, or
> possibly
> > > > > > intentional
> > > > > > > > > > > > > misdirection):
> > > > > > > > > > > > >
> > > > > > > > > > > > >     Turning Intel OMP on and off (and MKL as well,
> > since
> > > > it
> > > > > > > > tends to
> > > > > > > > > > pull
> > > > > > > > > > > > > in
> > > > > > > > > > > > >     omp, depending which one is linked in).
> > > > > > > > > > > > >     There is a HUGE difference.  This is consistent
> > with
> > > > my
> > > > > > > > > > experience
> > > > > > > > > > > > > before
> > > > > > > > > > > > >     when it was added.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >     default mnist:
> > > > > > > > > > > > >
> > > > > > > > > > > > >     python
> > ../example/image-classification/train_mnist.py
> > > > > > > > > > > > >     INFO:root:start with arguments
> > > > Namespace(add_stn=False,
> > > > > > > > > > > > batch_size=64,
> > > > > > > > > > > > >     disp_batches=100, dtype='float32',
> > gc_threshold=0.5,
> > > > > > > > > > gc_type='none',
> > > > > > > > > > > > >     gpus=None, image_shape='1, 28, 28',
> > > > > > initializer='default',
> > > > > > > > > > > > >     kv_store='device', load_epoch=None, loss='',
> > lr=0.05,
> > > > > > > > > > lr_factor=0.1,
> > > > > > > > > > > > >     lr_step_epochs='10', macrobatch_size=0,
> > > > > > model_prefix=None,
> > > > > > > > > > mom=0.9,
> > > > > > > > > > > > >     monitor=0, network='mlp', num_classes=10,
> > > > num_epochs=20,
> > > > > > > > > > > > >     num_examples=60000, num_layers=None,
> > optimizer='sgd',
> > > > > > > > > > > > >     profile_server_suffix='',
> > profile_worker_suffix='',
> > > > > > > > > > save_period=1,
> > > > > > > > > > > > >     test_io=0, top_k=0, warmup_epochs=5,
> > > > > > > > warmup_strategy='linear',
> > > > > > > > > > > > > wd=0.0001)
> > > > > > > > > > > > >
> > > > > > > > > > > > >     INTEL OMP:
> > > > > > > > > > > > >
> > > > > > > > > > > > >     ldd libmxnet.so | grep omp
> > > > > > > > > > > > >             libomp.so =>
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > > > > > > > > > > > >     (0x00007f978fde7000)
> > > > > > > > > > > > >
> > > > > > > > > > > > >     :root:Epoch[0] Batch [0-100]        Speed:
> > 31548.09
> > > > > > > > samples/sec
> > > > > > > > > > > > >     accuracy=0.780012
> > > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed:
> > > > 16073.21
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >     accuracy=0.920469
> > > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed:
> > > > 19075.91
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >     accuracy=0.928281
> > > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed:
> > > > 23211.36
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >     accuracy=0.942813
> > > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed:
> > > > 22139.79
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >     accuracy=0.938750
> > > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed:
> > > > 23225.52
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >     accuracy=0.946562
> > > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed:
> > > > 19547.41
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >     accuracy=0.953281
> > > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed:
> > > > 24111.73
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >     accuracy=0.951562
> > > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed:
> > > > 13959.88
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >     accuracy=0.957500
> > > > > > > > > > > > >     INFO:root:Epoch[0] Train-accuracy=0.925423
> > > > > > > > > > > > >     INFO:root:Epoch[0] Time cost=3.806
> > > > > > > > > > > > >     INFO:root:Epoch[0] Validation-accuracy=0.962580
> > > > > > > > > > > > >     INFO:root:Epoch[1] Batch [0-100]        Speed:
> > > > 24560.21
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >     accuracy=0.968131
> > > > > > > > > > > > >     INFO:root:Epoch[1] Batch [100-200]      Speed:
> > > > 23457.03
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >     accuracy=0.966250
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >     LIBGOMP:
> > > > > > > > > > > > >
> > > > > > > > > > > > >     ldd libmxnet.so | grep omp
> > > > > > > > > > > > >             libgomp.so.1 =>
> > > > > > > > /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > > > > > > > > >     (0x00007f25c25dd000)
> > > > > > > > > > > > >
> > > > > > > > > > > > >     INFO:root:Epoch[0] Batch [0-100]        Speed:
> > > > 1731.01
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >      accuracy=0.782488
> > > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed:
> > > > 3551.32
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >      accuracy=0.907813
> > > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed:
> > > > 1991.00
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >      accuracy=0.927188
> > > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed:
> > > > 2175.45
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >      accuracy=0.937969
> > > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed:
> > > > 1644.95
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >      accuracy=0.942187
> > > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed:
> > > > 6444.58
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >      accuracy=0.950156
> > > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed:
> > > > 7842.16
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >      accuracy=0.947969
> > > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed:
> > > > 9412.07
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >      accuracy=0.953750
> > > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed:
> > > > 12707.58
> > > > > > > > > > samples/sec
> > > > > > > > > > > > >     accuracy=0.953125
> > > > > > > > > > > > >
> > > > > > > > > > > > >     That being said, there's other issued beyond
> > speed.
> > > > The
> > > > > > > > DEFAULT
> > > > > > > > > > > > build
> > > > > > > > > > > > > from
> > > > > > > > > > > > >     makefile (not CMake) uses Intel OMP mkl (I
> showed
> > > > > > before) and
> > > > > > > > > > > > > mysteriously
> > > > > > > > > > > > >     it has no issues?  This seems highly
> suspicious.
> > > > All I
> > > > > > see
> > > > > > > > is a
> > > > > > > > > > lot
> > > > > > > > > > > > of
> > > > > > > > > > > > >     hand-waving and conjecture and pointing to
> > > > StackOverflow
> > > > > > > > posts
> > > > > > > > > > made
> > > > > > > > > > > > by
> > > > > > > > > > > > >     people who may be of questionable pedigree to
> > begin
> > > > with.
> > > > > > > > This
> > > > > > > > > > > > smells
> > > > > > > > > > > > > of a
> > > > > > > > > > > > >     Pedro-ego-fight rather than one of purely
> > technical
> > > > > > merit.
> > > > > > > > > > Also, if
> > > > > > > > > > > > > one
> > > > > > > > > > > > >     knows how OMP works,  they would be very
> > suspicious
> > > > of
> > > > > > the
> > > > > > > > > > > > > "intermittent
> > > > > > > > > > > > >     hangs" claim -- that's probably just broken
> race
> > > > > > conditions
> > > > > > > > > > elsewhere
> > > > > > > > > > > > > until
> > > > > > > > > > > > >     proven differently.  It'd tend freeze on the
> > first
> > > > use if
> > > > > > > > > > something
> > > > > > > > > > > > is
> > > > > > > > > > > > >     wrong (try using libgomp after a fork and see),
> > since
> > > > > > worker
> > > > > > > > > > threads"
> > > > > > > > > > > > >     wouldn't be assigned/joined properly.  IntelOMP
> > is
> > > > > > faster,
> > > > > > > > but
> > > > > > > > > > also
> > > > > > > > > > > > has
> > > > > > > > > > > > >     other advantages, such as allowing OMP after a
> > fork.
> > > > > > > > > > > > >
> > > > > > > > > > > > >     I actually addressed a lot of issues and ask
> for
> > > > > > > > clarification
> > > > > > > > > > in the
> > > > > > > > > > > > >     original PR's way back when, but they're all
> just
> > > > > > ignored.
> > > > > > > > > > > > >
> > > > > > > > > > > > >     -Chris
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >
>

Re: OMP

Posted by kellen sunderland <ke...@gmail.com>.
I remember this hang as well, it was pretty hard to reproduce IIRC.  I
believe the stacks for the hang are here:
https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 and
the trick was we could only debug it up to the point that we hit:

#0  0x00007fec6df1ba4f in futex_wait (private=0, expected=1,
futex_word=0x7fec60843758)
at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=0, expected=1, futex_word=0x7fec60843758)
    at ../sysdeps/nptl/futex-internal.h:135
#2  __pthread_once_slow (once_control=0x7fec60843758,
init_routine=0x7fec605f38f0)
    at pthread_once.c:105
...
#6  0x00007fec6061c577 in cudaSetDevice () from
/usr/local/cuda/lib64/libcudart.so.9.0

because the code in libcudart is obviously closed source we couldn't dig
into what threading work was going on when we called cudaSetDevice.

On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy <pe...@gmail.com>
wrote:

> If you check initialize.cc we seem to be explicitly disabling that
> behaviour in pthread_at_fork which seems to cause thread contention
> during multiprocessing. Why do we need this major advantage for the
> library if that's the case?
>
> Related PRs:
>
> https://github.com/apache/incubator-mxnet/pull/10820
> https://github.com/apache/incubator-mxnet/issues/14396
>
> The original code was authored in this PR:
>
> https://github.com/apache/incubator-mxnet/pull/8677
>
> I actually remember this fix, it was done during a release as the cuda
> runtime was forking and the engine was being re-entered. If that
> situation is now happening anymore it might not be needed any longer.
> I don't think we know the cause why there was a fork inside cuda, so
> the code has grown around a fix for an issue which its root cause was
> not understood, and side effects which this fix caused afterwards.
>
> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in
> the link above, no libgomp.
>
> I didn't try the Make build.
>
> I would refactor the code linked above and stop using pthread_at_fork,
> since OMP assumes it won't be initialized twice, but needs to be very
> well tested to make sure it doesn't cause bugs or affect the fixes
> done on the linked PRs above.
>
> Pedro.
>
> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier <cj...@gmail.com>
> wrote:
> >
> > one major advantage of intel/llvm omp is that it spawns a new thread pool
> > after fork if a thread pool was already created. this is so that omp can
> be
> > used in the forked processes. libgomp doesn’t do this so it’ll just lock
> up
> > if you try to do omp in the forked process.
> >
> > is your build linking libgomp as well?
> >
> > standard mkl build (from Makefile) uses same omp library. are there
> > problems with that build?
> >
> > what changes need to be made to make the assertion not fire?
> >
> > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy <
> pedro.larroy.lists@gmail.com>
> > wrote:
> >
> > > There's an assertion which is easily reproducible, and also there's a
> > > crash including core dump, the latter is not easy to reproduce for me
> > > in different environments. I have also seen mxnet getting stuck
> > > without progressing with this build configuration and using no CPU at
> > > all when running unit tests.
> > >
> > > In my view, the root cause of the assertion is that we are re-entering
> > > OMP initialization when spawning threads on the following code through
> > > pthread_at_fork
> > >
> > >
> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
> > >
> > > This causes double initialization of the OMP engine, including the
> > > assertion which you are asking about,  and I suspect some additional
> > > overhead. That's the shady forking part you are asking for.
> > >
> > > A question for you: What is the cause of runtime differences between
> > > OMP runtimes? Shouldn't the implementation overhead diminish as
> > > threads run longer?
> > >
> > > Pedro.
> > >
> > > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier <cj...@gmail.com>
> > > wrote:
> > > >
> > > > What’s the reason for the assertion failure? btw classifying an
> assertion
> > > > failure a “crash” is debatable. As I stated in the original issue a
> long
> > > > time ago, it’s possible something shady is being done with when
> forking
> > > > that should be fixed.  The assertion should be root caused.
> > > >
> > > >
> > > >
> > > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <
> > > pedro.larroy.lists@gmail.com>
> > > > wrote:
> > > >
> > > > > Added a dockerfile, and reports of a crash in my local machine when
> > > > > running MKL+OMP+DEBUG, with Anton's branch the crash happened as
> well.
> > > > > I couldn't reproduce the crash on my EC2 machine:
> > > > > Added the backtrace of the crash as well.
> > > > >
> > > > > https://github.com/apache/incubator-mxnet/issues/10856
> > > > >
> > > > > Dockerfile here:
> > > > >
> > > > > https://github.com/larroy/mxnet_omp
> > > > >
> > > > > Kind regards.
> > > > >
> > > > > Pedro.
> > > > >
> > > > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <
> > > marco.g.abreu@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > As already proposed, I think the easiest way to get a common
> > > > > understanding
> > > > > > is if we start with a few docker containers. Pedro, would it be
> > > possible
> > > > > > for you to wrap your benchmarks into a few containers that will
> > > produce
> > > > > > your shown results? That way, we can avoid possible
> > > misunderstandings and
> > > > > > also pinpoint the exact parts where people disagree or
> misunderstood
> > > each
> > > > > > other.
> > > > > >
> > > > > > -Marco
> > > > > >
> > > > > > Pedro Larroy <pe...@gmail.com> schrieb am Do., 20.
> Juni
> > > > > 2019,
> > > > > > 21:47:
> > > > > >
> > > > > > > I can confirm that we are linking with two versions of omp, I'm
> > > > > > > gaining more clarity into this topic, but I have still
> questions,
> > > the
> > > > > > > facts that I got so far are the folllowing:
> > > > > > >
> > > > > > > * #1: We are linking with two versions of omp, intel's omp and
> llvm
> > > > > > > openmp when building with MKL enabled.
> > > > > > > * #2: We have 3 different possible OMP versions: Intel OMP
> (comes
> > > with
> > > > > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc)
> (This
> > > > > > > one is used on the PR proposed by Anton).
> > > > > > >
> > > > > > > Questions:
> > > > > > >
> > > > > > >  * #1 Is it ok to have two versions of openmp linked at the
> same
> > > time?
> > > > > > >  * #2 Which implementation of OMP gives the best performance?
> (See
> > > > > > > total training time of my measurement for a partial answer)
> > > > > > >  * #3 Should we have a build flag so we can choose the OMP
> version
> > > at
> > > > > > > runtime?
> > > > > > >  * #4 Which Compiler and build flags did Chris use to get 10x
> > > slowdown?
> > > > > > >  * #5 @Stas: is there a script to replicate your benchmarks
> > > easily? If
> > > > > > > so could you provide a link?  I think we would need to
> reproduce
> > > your
> > > > > > > benchmarks and verify which versions are being linked. It's
> > > possible
> > > > > > > that while compiling with MKL intel's omp was pulled in
> instead of
> > > > > > > GNU OpenMP.
> > > > > > >  * #6 @Chris: how to maintain the copy of LLVM's Openmp?
> Should we
> > > > > > > update the subrepo regularly?
> > > > > > >
> > > > > > > My conclusion so far:
> > > > > > >
> > > > > > >  * #1 We should avoid linking two versions of omp if possible
> and
> > > > > > > allow users to choose one in the build as we do for BLAS.
> > > > > > >  * #2 For performance reasons and more control vs different
> > > compiler
> > > > > > > versions seems it makes indeed sense to keep the LLVM OpenMP
> > > version
> > > > > > > in 3rdparty for now. So unless some more data is gathered, it
> makes
> > > > > > > sense not to remove it as of now.
> > > > > > >  * #3 We should provide build options to choose which openmp
> > > library
> > > > > > > is to be used from the three options available, including
> libgomp.
> > > > > > >  * #4 Refining the build we could also enable OpenMP in mac
> without
> > > > > > > additional contortions (doesn't work as of today):
> > > > > > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
> > > > > > >  * #5 We should add different omp versions to our benchmarks
> and
> > > track
> > > > > > > the performance, so this data is available for prescribing the
> best
> > > > > > > build options and for binary releases.
> > > > > > >
> > > > > > > This is also an interesting related gh issue posted in the
> mkl-dnn
> > > > > > > repository:  https://github.com/intel/mkl-dnn/issues/230
> > > > > > >
> > > > > > >
> > > > > > > I don't observe the order of magnitude divergence reported by
> > > Chris in
> > > > > > > vanilla Ubuntu 18.04 in samples / s but the full training
> finishes
> > > > > > > indeed faster with the OMP from 3rdparty (LLVM openmp) vs
> libgomp.
> > > > > > >
> > > > > > > There's also differences in training time when using MKL and
> the ,
> > > > > > > it's actually a bit slower, I don't know if it's related to
> OMP.
> > > > > > >
> > > > > > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
> > > > > > >
> > > > > > > Anton's branch:  git@github.com:lebeg/incubator-mxnet.git
>  branch
> > > > > 'omp'
> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> > > > > > > build/libmxnet.so |grep -i omp
> > > > > > >         libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > > > (0x00007fd99a51d000)
> > > > > > >
> > > > > > > time python train_mnist.py
> > > > > > >
> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.984176
> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 41617.00
> samples/sec
> > > > > > >  accuracy=1.000000
> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 47990.69
> samples/sec
> > > > > > >  accuracy=0.999531
> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 47517.01
> samples/sec
> > > > > > >  accuracy=0.999687
> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 47430.53
> samples/sec
> > > > > > >  accuracy=1.000000
> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 47649.77
> samples/sec
> > > > > > >  accuracy=0.999687
> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 51708.12
> samples/sec
> > > > > > >  accuracy=0.999687
> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 57228.63
> samples/sec
> > > > > > >  accuracy=0.999375
> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 50887.85
> samples/sec
> > > > > > >  accuracy=0.999844
> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 53947.98
> samples/sec
> > > > > > >  accuracy=0.999531
> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > > > > > INFO:root:Epoch[19] Time cost=1.219
> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983977
> > > > > > > 1011.98user 26.78system 0:31.54elapsed 3292%CPU
> (0avgtext+0avgdata
> > > > > > > 1146052maxresident)k
> > > > > > > 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps
> > > > > > >
> > > > > > > Master, MKL ON:
> > > > > > >
> > > > > > > (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification
> [master]>
> > > ldd
> > > > > > > ../../build/libmxnet.so | grep -i omp
> > > > > > >         libomp.so =>
> > > > > > >
> > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > > > > > (0x00007f05ba38f000)
> > > > > > >         libiomp5.so =>
> > > > > > >
> > > > > > >
> > > > >
> > >
> /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
> > > > > > > (0x00007f05b09f4000)
> > > > > > >
> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.982484
> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 36651.63
> samples/sec
> > > > > > >  accuracy=0.999691
> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 45093.98
> samples/sec
> > > > > > >  accuracy=0.999844
> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 45146.84
> samples/sec
> > > > > > >  accuracy=0.999687
> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 45119.90
> samples/sec
> > > > > > >  accuracy=0.999687
> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 44998.96
> samples/sec
> > > > > > >  accuracy=0.999531
> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 45072.25
> samples/sec
> > > > > > >  accuracy=0.999844
> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 44969.79
> samples/sec
> > > > > > >  accuracy=0.999844
> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 44962.78
> samples/sec
> > > > > > >  accuracy=0.999844
> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 44945.47
> samples/sec
> > > > > > >  accuracy=0.999375
> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > > > > > INFO:root:Epoch[19] Time cost=1.367
> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.982783
> > > > > > > 854.97user 847.21system 0:41.44elapsed 4106%CPU
> (0avgtext+0avgdata
> > > > > > > 1154348maxresident)k
> > > > > > > 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps
> > > > > > >
> > > > > > >
> > > > > > > MKL OFF:
> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> grep -i
> MKL
> > > > > > > cmake_options.yml
> > > > > > > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
> > > > > > > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL
> found) IF
> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > > > > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
> > > > > > > build/libmxnet.so |grep -i omp
> > > > > > >         libomp.so =>
> > > > > > >
> > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > > > > > (0x00007fb720c54000)
> > > > > > >
> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.983479
> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 46784.02
> samples/sec
> > > > > > >  accuracy=1.000000
> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 48824.29
> samples/sec
> > > > > > >  accuracy=0.999687
> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 49190.31
> samples/sec
> > > > > > >  accuracy=0.999687
> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 51518.77
> samples/sec
> > > > > > >  accuracy=0.999844
> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 51551.62
> samples/sec
> > > > > > >  accuracy=0.999844
> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 49026.35
> samples/sec
> > > > > > >  accuracy=0.999844
> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 49002.46
> samples/sec
> > > > > > >  accuracy=0.999375
> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 48980.55
> samples/sec
> > > > > > >  accuracy=0.999687
> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 47402.56
> samples/sec
> > > > > > >  accuracy=0.999844
> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999767
> > > > > > > INFO:root:Epoch[19] Time cost=1.259
> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983181
> > > > > > > 755.36user 754.94system 0:35.89elapsed 4207%CPU
> (0avgtext+0avgdata
> > > > > > > 1147008maxresident)k
> > > > > > > 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps
> > > > > > >
> > > > > > > Let me know what you think.
> > > > > > >
> > > > > > > Link to the original PR:
> > > > > > > https://github.com/apache/incubator-mxnet/pull/12160
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
> > > > > > > <ke...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > "if you’re linking in two then you’re doing something wrong."
> > > > > Correct,
> > > > > > > > that's one thing I believe we've got consensus on.  So let's
> call
> > > > > that
> > > > > > > out
> > > > > > > > as a bug to be fixed.
> > > > > > > >
> > > > > > > > Let's move forward with some reproducible numbers and then
> > > discuss
> > > > > the
> > > > > > > pros
> > > > > > > > / cons of which particular OMP implementation we should use.
> > > > > > > >
> > > > > > > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <
> > > > > > > pedro.larroy.lists@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Chris
> > > > > > > > >
> > > > > > > > > I would ask you to have a bit of patience and help us with
> your
> > > > > > > > > experience in this matter. Nobody is ignoring anything, I
> > > think we
> > > > > are
> > > > > > > > > individually gathering feedbacks and trying to understand
> the
> > > > > multiple
> > > > > > > > > contributions done to this topic including yours, then go
> step
> > > by
> > > > > > > > > step, understand what is going on and run experiments and
> > > report
> > > > > back
> > > > > > > > > to the list or the corresponding github item. It was
> suggested
> > > by
> > > > > > > > > Kellen to prepare some containers, this takes effort.
> > > > > > > > >
> > > > > > > > > Regarding your final comment, most of us also have many
> other
> > > > > things
> > > > > > > > > to do and responsibilities even if our daytime jobs might
> > > involve
> > > > > > > > > MXNet in some form or another. I think that's part of the
> > > privilege
> > > > > > > > > and responsibility of working close with an open source
> > > project and
> > > > > > > > > the magic of collaboration across organizations. Let's all
> be
> > > > > patient
> > > > > > > > > and take some time to understand and reason about this
> topic
> > > which
> > > > > is
> > > > > > > > > not simple. Since we decided to step back and gather more
> data
> > > > > let's
> > > > > > > > > take time and do it properly.
> > > > > > > > >
> > > > > > > > > Personally I hope to find time to look again into this
> issue
> > > before
> > > > > > > > > the end of the week.
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > >
> > > > > > > > > Pedro.
> > > > > > > > >
> > > > > > > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <
> > > > > cjolivier01@apache.org>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > if you’re linking in two then you’re doing something
> wrong.
> > > You
> > > > > can
> > > > > > > see
> > > > > > > > > by
> > > > > > > > > > my email yesterday that only one is linked in. This is
> also
> > > the
> > > > > case
> > > > > > > with
> > > > > > > > > > the mkl version built by the Makefile — only the Intel
> OMP
> > > > > library is
> > > > > > > > > used
> > > > > > > > > > (no libgomp).
> > > > > > > > > >
> > > > > > > > > > That being said, Do you have clear evidence that using
> Intel
> > > OMP
> > > > > is
> > > > > > > both
> > > > > > > > > > problematic and the situation isn’t fixable?  The burden
> of
> > > > > proof is
> > > > > > > on
> > > > > > > > > the
> > > > > > > > > > ones requesting the change — it is not my responsibility
> to
> > > > > justify
> > > > > > > the
> > > > > > > > > > current state.  There must be something “terrible” and
> > > unfixable
> > > > > to
> > > > > > > > > justify
> > > > > > > > > > a change.  I have seen no proof of this in all this time.
> > > > > > > > > >
> > > > > > > > > > On a side note, I mentioned a couple of things in my
> email
> > > > > yesterday
> > > > > > > that
> > > > > > > > > > still are not being responded to (they were also ignored
> in
> > > the
> > > > > last
> > > > > > > > > > incarnation of this “discussion” — I have much
> experience in
> > > this
> > > > > > > matter
> > > > > > > > > to
> > > > > > > > > > assume “discussion” is a waste of my time, seeing and I
> am
> > > not
> > > > > paid
> > > > > > > to
> > > > > > > > > > “work on” mxnet like y’all are).
> > > > > > > > > >
> > > > > > > > > > -C
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> > > > > > > > > > kellen.sunderland@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > I've also quite often seen two versions of OpenMP
> linked.
> > > I
> > > > > think
> > > > > > > we
> > > > > > > > > can
> > > > > > > > > > > all agree we probably want to avoid linking in two
> > > libraries
> > > > > that
> > > > > > > do
> > > > > > > > > > > effectively the same thing.
> > > > > > > > > > >
> > > > > > > > > > > The performance questions should be fairly straight
> > > forward to
> > > > > > > > > demonstrate
> > > > > > > > > > > right?  Could we just collaborate on a few minimal
> > > Dockerfiles
> > > > > that
> > > > > > > > > show
> > > > > > > > > > > (or don't show) Intel OpenMP performance speedups with
> the
> > > > > > > workloads
> > > > > > > > > Chris
> > > > > > > > > > > is referencing?
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> > > > > > > > > > > stanislav.tsukrov@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi, Chris!
> > > > > > > > > > > >
> > > > > > > > > > > > Stas here - I've gathered that performance data.
> > > > > > > > > > > > Sure thing, I can be wrong, but please elaborate a
> bit on
> > > > > what
> > > > > > > we are
> > > > > > > > > > > > missing.
> > > > > > > > > > > > Be assured, intentional misdirection was never a
> case.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks a lot for being constructive.
> > > > > > > > > > > >
> > > > > > > > > > > > > Turning Intel OMP on and off (and MKL as well,
> since it
> > > > > tends
> > > > > > > to
> > > > > > > > > pull
> > > > > > > > > > > in
> > > > > > > > > > > > omp, depending which one is linked in).
> > > > > > > > > > > >
> > > > > > > > > > > > We never ever considered turning MKL off. We are on
> the
> > > same
> > > > > page
> > > > > > > > > here -
> > > > > > > > > > > > MKL is crucial for the performance.
> > > > > > > > > > > > Why should we? There's a GOMP-linked version of MKL,
> > > that we
> > > > > can
> > > > > > > use.
> > > > > > > > > > > >
> > > > > > > > > > > > What we did - we measured, if using compilers default
> > > OpenMP
> > > > > > > > > > > > implementation instead of referenced source code
> > > > > distribution of
> > > > > > > > > OpenMP
> > > > > > > > > > > > makes anything slower.
> > > > > > > > > > > > We have found the impact to be hardly measurable.
> > > > > > > > > > > > The difference between GOMP and iOMP is <5% on our
> > > > > benchmarks,
> > > > > > > most
> > > > > > > > > of
> > > > > > > > > > > the
> > > > > > > > > > > > time less than that.
> > > > > > > > > > > >
> > > > > > > > > > > > We just suggest to simplify the build of mxnet, by
> > > removing
> > > > > the
> > > > > > > > > > > > unnecessary dependency.
> > > > > > > > > > > >
> > > > > > > > > > > > During that we discovered for example the following
> > > amazing
> > > > > > > issue:
> > > > > > > > > > > >
> https://github.com/apache/incubator-mxnet/issues/14087
> > > > > > > > > > > >
> > > > > > > > > > > > Best Regards
> > > > > > > > > > > >
> > > > > > > > > > > > Stas
> > > > > > > > > > > >
> > > > > > > > > > > > On 18.06.19, 18:24, "Chris Olivier" <
> > > cjolivier01@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >     I am very reluctant to feed the trolls again, and
> > > this
> > > > > will
> > > > > > > be
> > > > > > > > > teh
> > > > > > > > > > > last
> > > > > > > > > > > >     time I address Pedro or Anton on the subject, but
> > > since I
> > > > > > > think
> > > > > > > > > the
> > > > > > > > > > > > numbers
> > > > > > > > > > > >     being presented are incorrect (either by te
> builders
> > > not
> > > > > > > really
> > > > > > > > > > > >     understanding what they are building, or possibly
> > > > > intentional
> > > > > > > > > > > > misdirection):
> > > > > > > > > > > >
> > > > > > > > > > > >     Turning Intel OMP on and off (and MKL as well,
> since
> > > it
> > > > > > > tends to
> > > > > > > > > pull
> > > > > > > > > > > > in
> > > > > > > > > > > >     omp, depending which one is linked in).
> > > > > > > > > > > >     There is a HUGE difference.  This is consistent
> with
> > > my
> > > > > > > > > experience
> > > > > > > > > > > > before
> > > > > > > > > > > >     when it was added.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >     default mnist:
> > > > > > > > > > > >
> > > > > > > > > > > >     python
> ../example/image-classification/train_mnist.py
> > > > > > > > > > > >     INFO:root:start with arguments
> > > Namespace(add_stn=False,
> > > > > > > > > > > batch_size=64,
> > > > > > > > > > > >     disp_batches=100, dtype='float32',
> gc_threshold=0.5,
> > > > > > > > > gc_type='none',
> > > > > > > > > > > >     gpus=None, image_shape='1, 28, 28',
> > > > > initializer='default',
> > > > > > > > > > > >     kv_store='device', load_epoch=None, loss='',
> lr=0.05,
> > > > > > > > > lr_factor=0.1,
> > > > > > > > > > > >     lr_step_epochs='10', macrobatch_size=0,
> > > > > model_prefix=None,
> > > > > > > > > mom=0.9,
> > > > > > > > > > > >     monitor=0, network='mlp', num_classes=10,
> > > num_epochs=20,
> > > > > > > > > > > >     num_examples=60000, num_layers=None,
> optimizer='sgd',
> > > > > > > > > > > >     profile_server_suffix='',
> profile_worker_suffix='',
> > > > > > > > > save_period=1,
> > > > > > > > > > > >     test_io=0, top_k=0, warmup_epochs=5,
> > > > > > > warmup_strategy='linear',
> > > > > > > > > > > > wd=0.0001)
> > > > > > > > > > > >
> > > > > > > > > > > >     INTEL OMP:
> > > > > > > > > > > >
> > > > > > > > > > > >     ldd libmxnet.so | grep omp
> > > > > > > > > > > >             libomp.so =>
> > > > > > > > > > > >
> > > > > > > > >
> > > > >
> /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > > > > > > > > > > >     (0x00007f978fde7000)
> > > > > > > > > > > >
> > > > > > > > > > > >     :root:Epoch[0] Batch [0-100]        Speed:
> 31548.09
> > > > > > > samples/sec
> > > > > > > > > > > >     accuracy=0.780012
> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed:
> > > 16073.21
> > > > > > > > > samples/sec
> > > > > > > > > > > >     accuracy=0.920469
> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed:
> > > 19075.91
> > > > > > > > > samples/sec
> > > > > > > > > > > >     accuracy=0.928281
> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed:
> > > 23211.36
> > > > > > > > > samples/sec
> > > > > > > > > > > >     accuracy=0.942813
> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed:
> > > 22139.79
> > > > > > > > > samples/sec
> > > > > > > > > > > >     accuracy=0.938750
> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed:
> > > 23225.52
> > > > > > > > > samples/sec
> > > > > > > > > > > >     accuracy=0.946562
> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed:
> > > 19547.41
> > > > > > > > > samples/sec
> > > > > > > > > > > >     accuracy=0.953281
> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed:
> > > 24111.73
> > > > > > > > > samples/sec
> > > > > > > > > > > >     accuracy=0.951562
> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed:
> > > 13959.88
> > > > > > > > > samples/sec
> > > > > > > > > > > >     accuracy=0.957500
> > > > > > > > > > > >     INFO:root:Epoch[0] Train-accuracy=0.925423
> > > > > > > > > > > >     INFO:root:Epoch[0] Time cost=3.806
> > > > > > > > > > > >     INFO:root:Epoch[0] Validation-accuracy=0.962580
> > > > > > > > > > > >     INFO:root:Epoch[1] Batch [0-100]        Speed:
> > > 24560.21
> > > > > > > > > samples/sec
> > > > > > > > > > > >     accuracy=0.968131
> > > > > > > > > > > >     INFO:root:Epoch[1] Batch [100-200]      Speed:
> > > 23457.03
> > > > > > > > > samples/sec
> > > > > > > > > > > >     accuracy=0.966250
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >     LIBGOMP:
> > > > > > > > > > > >
> > > > > > > > > > > >     ldd libmxnet.so | grep omp
> > > > > > > > > > > >             libgomp.so.1 =>
> > > > > > > /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > > > > > > > >     (0x00007f25c25dd000)
> > > > > > > > > > > >
> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [0-100]        Speed:
> > > 1731.01
> > > > > > > > > samples/sec
> > > > > > > > > > > >      accuracy=0.782488
> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed:
> > > 3551.32
> > > > > > > > > samples/sec
> > > > > > > > > > > >      accuracy=0.907813
> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed:
> > > 1991.00
> > > > > > > > > samples/sec
> > > > > > > > > > > >      accuracy=0.927188
> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed:
> > > 2175.45
> > > > > > > > > samples/sec
> > > > > > > > > > > >      accuracy=0.937969
> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed:
> > > 1644.95
> > > > > > > > > samples/sec
> > > > > > > > > > > >      accuracy=0.942187
> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed:
> > > 6444.58
> > > > > > > > > samples/sec
> > > > > > > > > > > >      accuracy=0.950156
> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed:
> > > 7842.16
> > > > > > > > > samples/sec
> > > > > > > > > > > >      accuracy=0.947969
> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed:
> > > 9412.07
> > > > > > > > > samples/sec
> > > > > > > > > > > >      accuracy=0.953750
> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed:
> > > 12707.58
> > > > > > > > > samples/sec
> > > > > > > > > > > >     accuracy=0.953125
> > > > > > > > > > > >
> > > > > > > > > > > >     That being said, there's other issued beyond
> speed.
> > > The
> > > > > > > DEFAULT
> > > > > > > > > > > build
> > > > > > > > > > > > from
> > > > > > > > > > > >     makefile (not CMake) uses Intel OMP mkl (I showed
> > > > > before) and
> > > > > > > > > > > > mysteriously
> > > > > > > > > > > >     it has no issues?  This seems highly suspicious.
> > > All I
> > > > > see
> > > > > > > is a
> > > > > > > > > lot
> > > > > > > > > > > of
> > > > > > > > > > > >     hand-waving and conjecture and pointing to
> > > StackOverflow
> > > > > > > posts
> > > > > > > > > made
> > > > > > > > > > > by
> > > > > > > > > > > >     people who may be of questionable pedigree to
> begin
> > > with.
> > > > > > > This
> > > > > > > > > > > smells
> > > > > > > > > > > > of a
> > > > > > > > > > > >     Pedro-ego-fight rather than one of purely
> technical
> > > > > merit.
> > > > > > > > > Also, if
> > > > > > > > > > > > one
> > > > > > > > > > > >     knows how OMP works,  they would be very
> suspicious
> > > of
> > > > > the
> > > > > > > > > > > > "intermittent
> > > > > > > > > > > >     hangs" claim -- that's probably just broken race
> > > > > conditions
> > > > > > > > > elsewhere
> > > > > > > > > > > > until
> > > > > > > > > > > >     proven differently.  It'd tend freeze on the
> first
> > > use if
> > > > > > > > > something
> > > > > > > > > > > is
> > > > > > > > > > > >     wrong (try using libgomp after a fork and see),
> since
> > > > > worker
> > > > > > > > > threads"
> > > > > > > > > > > >     wouldn't be assigned/joined properly.  IntelOMP
> is
> > > > > faster,
> > > > > > > but
> > > > > > > > > also
> > > > > > > > > > > has
> > > > > > > > > > > >     other advantages, such as allowing OMP after a
> fork.
> > > > > > > > > > > >
> > > > > > > > > > > >     I actually addressed a lot of issues and ask for
> > > > > > > clarification
> > > > > > > > > in the
> > > > > > > > > > > >     original PR's way back when, but they're all just
> > > > > ignored.
> > > > > > > > > > > >
> > > > > > > > > > > >     -Chris
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
>

Re: OMP

Posted by Pedro Larroy <pe...@gmail.com>.
If you check initialize.cc we seem to be explicitly disabling that
behaviour in pthread_at_fork which seems to cause thread contention
during multiprocessing. Why do we need this major advantage for the
library if that's the case?

Related PRs:

https://github.com/apache/incubator-mxnet/pull/10820
https://github.com/apache/incubator-mxnet/issues/14396

The original code was authored in this PR:

https://github.com/apache/incubator-mxnet/pull/8677

I actually remember this fix, it was done during a release as the cuda
runtime was forking and the engine was being re-entered. If that
situation is now happening anymore it might not be needed any longer.
I don't think we know the cause why there was a fork inside cuda, so
the code has grown around a fix for an issue which its root cause was
not understood, and side effects which this fix caused afterwards.

My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in
the link above, no libgomp.

I didn't try the Make build.

I would refactor the code linked above and stop using pthread_at_fork,
since OMP assumes it won't be initialized twice, but needs to be very
well tested to make sure it doesn't cause bugs or affect the fixes
done on the linked PRs above.

Pedro.

On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier <cj...@gmail.com> wrote:
>
> one major advantage of intel/llvm omp is that it spawns a new thread pool
> after fork if a thread pool was already created. this is so that omp can be
> used in the forked processes. libgomp doesn’t do this so it’ll just lock up
> if you try to do omp in the forked process.
>
> is your build linking libgomp as well?
>
> standard mkl build (from Makefile) uses same omp library. are there
> problems with that build?
>
> what changes need to be made to make the assertion not fire?
>
> On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy <pe...@gmail.com>
> wrote:
>
> > There's an assertion which is easily reproducible, and also there's a
> > crash including core dump, the latter is not easy to reproduce for me
> > in different environments. I have also seen mxnet getting stuck
> > without progressing with this build configuration and using no CPU at
> > all when running unit tests.
> >
> > In my view, the root cause of the assertion is that we are re-entering
> > OMP initialization when spawning threads on the following code through
> > pthread_at_fork
> >
> > https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
> >
> > This causes double initialization of the OMP engine, including the
> > assertion which you are asking about,  and I suspect some additional
> > overhead. That's the shady forking part you are asking for.
> >
> > A question for you: What is the cause of runtime differences between
> > OMP runtimes? Shouldn't the implementation overhead diminish as
> > threads run longer?
> >
> > Pedro.
> >
> > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier <cj...@gmail.com>
> > wrote:
> > >
> > > What’s the reason for the assertion failure? btw classifying an assertion
> > > failure a “crash” is debatable. As I stated in the original issue a long
> > > time ago, it’s possible something shady is being done with when forking
> > > that should be fixed.  The assertion should be root caused.
> > >
> > >
> > >
> > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <
> > pedro.larroy.lists@gmail.com>
> > > wrote:
> > >
> > > > Added a dockerfile, and reports of a crash in my local machine when
> > > > running MKL+OMP+DEBUG, with Anton's branch the crash happened as well.
> > > > I couldn't reproduce the crash on my EC2 machine:
> > > > Added the backtrace of the crash as well.
> > > >
> > > > https://github.com/apache/incubator-mxnet/issues/10856
> > > >
> > > > Dockerfile here:
> > > >
> > > > https://github.com/larroy/mxnet_omp
> > > >
> > > > Kind regards.
> > > >
> > > > Pedro.
> > > >
> > > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <
> > marco.g.abreu@gmail.com>
> > > > wrote:
> > > > >
> > > > > As already proposed, I think the easiest way to get a common
> > > > understanding
> > > > > is if we start with a few docker containers. Pedro, would it be
> > possible
> > > > > for you to wrap your benchmarks into a few containers that will
> > produce
> > > > > your shown results? That way, we can avoid possible
> > misunderstandings and
> > > > > also pinpoint the exact parts where people disagree or misunderstood
> > each
> > > > > other.
> > > > >
> > > > > -Marco
> > > > >
> > > > > Pedro Larroy <pe...@gmail.com> schrieb am Do., 20. Juni
> > > > 2019,
> > > > > 21:47:
> > > > >
> > > > > > I can confirm that we are linking with two versions of omp, I'm
> > > > > > gaining more clarity into this topic, but I have still questions,
> > the
> > > > > > facts that I got so far are the folllowing:
> > > > > >
> > > > > > * #1: We are linking with two versions of omp, intel's omp and llvm
> > > > > > openmp when building with MKL enabled.
> > > > > > * #2: We have 3 different possible OMP versions: Intel OMP (comes
> > with
> > > > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
> > > > > > one is used on the PR proposed by Anton).
> > > > > >
> > > > > > Questions:
> > > > > >
> > > > > >  * #1 Is it ok to have two versions of openmp linked at the same
> > time?
> > > > > >  * #2 Which implementation of OMP gives the best performance?  (See
> > > > > > total training time of my measurement for a partial answer)
> > > > > >  * #3 Should we have a build flag so we can choose the OMP version
> > at
> > > > > > runtime?
> > > > > >  * #4 Which Compiler and build flags did Chris use to get 10x
> > slowdown?
> > > > > >  * #5 @Stas: is there a script to replicate your benchmarks
> > easily? If
> > > > > > so could you provide a link?  I think we would need to reproduce
> > your
> > > > > > benchmarks and verify which versions are being linked. It's
> > possible
> > > > > > that while compiling with MKL intel's omp was pulled in instead of
> > > > > > GNU OpenMP.
> > > > > >  * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we
> > > > > > update the subrepo regularly?
> > > > > >
> > > > > > My conclusion so far:
> > > > > >
> > > > > >  * #1 We should avoid linking two versions of omp if possible and
> > > > > > allow users to choose one in the build as we do for BLAS.
> > > > > >  * #2 For performance reasons and more control vs different
> > compiler
> > > > > > versions seems it makes indeed sense to keep the LLVM OpenMP
> > version
> > > > > > in 3rdparty for now. So unless some more data is gathered, it makes
> > > > > > sense not to remove it as of now.
> > > > > >  * #3 We should provide build options to choose which openmp
> > library
> > > > > > is to be used from the three options available, including libgomp.
> > > > > >  * #4 Refining the build we could also enable OpenMP in mac without
> > > > > > additional contortions (doesn't work as of today):
> > > > > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
> > > > > >  * #5 We should add different omp versions to our benchmarks and
> > track
> > > > > > the performance, so this data is available for prescribing the best
> > > > > > build options and for binary releases.
> > > > > >
> > > > > > This is also an interesting related gh issue posted in the mkl-dnn
> > > > > > repository:  https://github.com/intel/mkl-dnn/issues/230
> > > > > >
> > > > > >
> > > > > > I don't observe the order of magnitude divergence reported by
> > Chris in
> > > > > > vanilla Ubuntu 18.04 in samples / s but the full training finishes
> > > > > > indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp.
> > > > > >
> > > > > > There's also differences in training time when using MKL and the ,
> > > > > > it's actually a bit slower, I don't know if it's related to OMP.
> > > > > >
> > > > > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
> > > > > >
> > > > > > Anton's branch:  git@github.com:lebeg/incubator-mxnet.git   branch
> > > > 'omp'
> > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> > > > > > build/libmxnet.so |grep -i omp
> > > > > >         libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > > (0x00007fd99a51d000)
> > > > > >
> > > > > > time python train_mnist.py
> > > > > >
> > > > > > INFO:root:Epoch[18] Validation-accuracy=0.984176
> > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 41617.00 samples/sec
> > > > > >  accuracy=1.000000
> > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 47990.69 samples/sec
> > > > > >  accuracy=0.999531
> > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 47517.01 samples/sec
> > > > > >  accuracy=0.999687
> > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 47430.53 samples/sec
> > > > > >  accuracy=1.000000
> > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 47649.77 samples/sec
> > > > > >  accuracy=0.999687
> > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 51708.12 samples/sec
> > > > > >  accuracy=0.999687
> > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 57228.63 samples/sec
> > > > > >  accuracy=0.999375
> > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 50887.85 samples/sec
> > > > > >  accuracy=0.999844
> > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 53947.98 samples/sec
> > > > > >  accuracy=0.999531
> > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > > > > INFO:root:Epoch[19] Time cost=1.219
> > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983977
> > > > > > 1011.98user 26.78system 0:31.54elapsed 3292%CPU (0avgtext+0avgdata
> > > > > > 1146052maxresident)k
> > > > > > 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps
> > > > > >
> > > > > > Master, MKL ON:
> > > > > >
> > > > > > (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification [master]>
> > ldd
> > > > > > ../../build/libmxnet.so | grep -i omp
> > > > > >         libomp.so =>
> > > > > >
> > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > > > > (0x00007f05ba38f000)
> > > > > >         libiomp5.so =>
> > > > > >
> > > > > >
> > > >
> > /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
> > > > > > (0x00007f05b09f4000)
> > > > > >
> > > > > > INFO:root:Epoch[18] Validation-accuracy=0.982484
> > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 36651.63 samples/sec
> > > > > >  accuracy=0.999691
> > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 45093.98 samples/sec
> > > > > >  accuracy=0.999844
> > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 45146.84 samples/sec
> > > > > >  accuracy=0.999687
> > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 45119.90 samples/sec
> > > > > >  accuracy=0.999687
> > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 44998.96 samples/sec
> > > > > >  accuracy=0.999531
> > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 45072.25 samples/sec
> > > > > >  accuracy=0.999844
> > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 44969.79 samples/sec
> > > > > >  accuracy=0.999844
> > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 44962.78 samples/sec
> > > > > >  accuracy=0.999844
> > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 44945.47 samples/sec
> > > > > >  accuracy=0.999375
> > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > > > > INFO:root:Epoch[19] Time cost=1.367
> > > > > > INFO:root:Epoch[19] Validation-accuracy=0.982783
> > > > > > 854.97user 847.21system 0:41.44elapsed 4106%CPU (0avgtext+0avgdata
> > > > > > 1154348maxresident)k
> > > > > > 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps
> > > > > >
> > > > > >
> > > > > > MKL OFF:
> > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> grep -i MKL
> > > > > > cmake_options.yml
> > > > > > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
> > > > > > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
> > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > > > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
> > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
> > > > > > build/libmxnet.so |grep -i omp
> > > > > >         libomp.so =>
> > > > > >
> > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > > > > (0x00007fb720c54000)
> > > > > >
> > > > > > INFO:root:Epoch[18] Validation-accuracy=0.983479
> > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 46784.02 samples/sec
> > > > > >  accuracy=1.000000
> > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 48824.29 samples/sec
> > > > > >  accuracy=0.999687
> > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 49190.31 samples/sec
> > > > > >  accuracy=0.999687
> > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 51518.77 samples/sec
> > > > > >  accuracy=0.999844
> > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 51551.62 samples/sec
> > > > > >  accuracy=0.999844
> > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 49026.35 samples/sec
> > > > > >  accuracy=0.999844
> > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 49002.46 samples/sec
> > > > > >  accuracy=0.999375
> > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 48980.55 samples/sec
> > > > > >  accuracy=0.999687
> > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 47402.56 samples/sec
> > > > > >  accuracy=0.999844
> > > > > > INFO:root:Epoch[19] Train-accuracy=0.999767
> > > > > > INFO:root:Epoch[19] Time cost=1.259
> > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983181
> > > > > > 755.36user 754.94system 0:35.89elapsed 4207%CPU (0avgtext+0avgdata
> > > > > > 1147008maxresident)k
> > > > > > 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps
> > > > > >
> > > > > > Let me know what you think.
> > > > > >
> > > > > > Link to the original PR:
> > > > > > https://github.com/apache/incubator-mxnet/pull/12160
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
> > > > > > <ke...@gmail.com> wrote:
> > > > > > >
> > > > > > > "if you’re linking in two then you’re doing something wrong."
> > > > Correct,
> > > > > > > that's one thing I believe we've got consensus on.  So let's call
> > > > that
> > > > > > out
> > > > > > > as a bug to be fixed.
> > > > > > >
> > > > > > > Let's move forward with some reproducible numbers and then
> > discuss
> > > > the
> > > > > > pros
> > > > > > > / cons of which particular OMP implementation we should use.
> > > > > > >
> > > > > > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <
> > > > > > pedro.larroy.lists@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Chris
> > > > > > > >
> > > > > > > > I would ask you to have a bit of patience and help us with your
> > > > > > > > experience in this matter. Nobody is ignoring anything, I
> > think we
> > > > are
> > > > > > > > individually gathering feedbacks and trying to understand the
> > > > multiple
> > > > > > > > contributions done to this topic including yours, then go step
> > by
> > > > > > > > step, understand what is going on and run experiments and
> > report
> > > > back
> > > > > > > > to the list or the corresponding github item. It was suggested
> > by
> > > > > > > > Kellen to prepare some containers, this takes effort.
> > > > > > > >
> > > > > > > > Regarding your final comment, most of us also have many other
> > > > things
> > > > > > > > to do and responsibilities even if our daytime jobs might
> > involve
> > > > > > > > MXNet in some form or another. I think that's part of the
> > privilege
> > > > > > > > and responsibility of working close with an open source
> > project and
> > > > > > > > the magic of collaboration across organizations. Let's all be
> > > > patient
> > > > > > > > and take some time to understand and reason about this topic
> > which
> > > > is
> > > > > > > > not simple. Since we decided to step back and gather more data
> > > > let's
> > > > > > > > take time and do it properly.
> > > > > > > >
> > > > > > > > Personally I hope to find time to look again into this issue
> > before
> > > > > > > > the end of the week.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > > Pedro.
> > > > > > > >
> > > > > > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <
> > > > cjolivier01@apache.org>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > if you’re linking in two then you’re doing something wrong.
> > You
> > > > can
> > > > > > see
> > > > > > > > by
> > > > > > > > > my email yesterday that only one is linked in. This is also
> > the
> > > > case
> > > > > > with
> > > > > > > > > the mkl version built by the Makefile — only the Intel OMP
> > > > library is
> > > > > > > > used
> > > > > > > > > (no libgomp).
> > > > > > > > >
> > > > > > > > > That being said, Do you have clear evidence that using Intel
> > OMP
> > > > is
> > > > > > both
> > > > > > > > > problematic and the situation isn’t fixable?  The burden of
> > > > proof is
> > > > > > on
> > > > > > > > the
> > > > > > > > > ones requesting the change — it is not my responsibility to
> > > > justify
> > > > > > the
> > > > > > > > > current state.  There must be something “terrible” and
> > unfixable
> > > > to
> > > > > > > > justify
> > > > > > > > > a change.  I have seen no proof of this in all this time.
> > > > > > > > >
> > > > > > > > > On a side note, I mentioned a couple of things in my email
> > > > yesterday
> > > > > > that
> > > > > > > > > still are not being responded to (they were also ignored in
> > the
> > > > last
> > > > > > > > > incarnation of this “discussion” — I have much experience in
> > this
> > > > > > matter
> > > > > > > > to
> > > > > > > > > assume “discussion” is a waste of my time, seeing and I am
> > not
> > > > paid
> > > > > > to
> > > > > > > > > “work on” mxnet like y’all are).
> > > > > > > > >
> > > > > > > > > -C
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> > > > > > > > > kellen.sunderland@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > I've also quite often seen two versions of OpenMP linked.
> > I
> > > > think
> > > > > > we
> > > > > > > > can
> > > > > > > > > > all agree we probably want to avoid linking in two
> > libraries
> > > > that
> > > > > > do
> > > > > > > > > > effectively the same thing.
> > > > > > > > > >
> > > > > > > > > > The performance questions should be fairly straight
> > forward to
> > > > > > > > demonstrate
> > > > > > > > > > right?  Could we just collaborate on a few minimal
> > Dockerfiles
> > > > that
> > > > > > > > show
> > > > > > > > > > (or don't show) Intel OpenMP performance speedups with the
> > > > > > workloads
> > > > > > > > Chris
> > > > > > > > > > is referencing?
> > > > > > > > > >
> > > > > > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> > > > > > > > > > stanislav.tsukrov@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi, Chris!
> > > > > > > > > > >
> > > > > > > > > > > Stas here - I've gathered that performance data.
> > > > > > > > > > > Sure thing, I can be wrong, but please elaborate a bit on
> > > > what
> > > > > > we are
> > > > > > > > > > > missing.
> > > > > > > > > > > Be assured, intentional misdirection was never a case.
> > > > > > > > > > >
> > > > > > > > > > > Thanks a lot for being constructive.
> > > > > > > > > > >
> > > > > > > > > > > > Turning Intel OMP on and off (and MKL as well, since it
> > > > tends
> > > > > > to
> > > > > > > > pull
> > > > > > > > > > in
> > > > > > > > > > > omp, depending which one is linked in).
> > > > > > > > > > >
> > > > > > > > > > > We never ever considered turning MKL off. We are on the
> > same
> > > > page
> > > > > > > > here -
> > > > > > > > > > > MKL is crucial for the performance.
> > > > > > > > > > > Why should we? There's a GOMP-linked version of MKL,
> > that we
> > > > can
> > > > > > use.
> > > > > > > > > > >
> > > > > > > > > > > What we did - we measured, if using compilers default
> > OpenMP
> > > > > > > > > > > implementation instead of referenced source code
> > > > distribution of
> > > > > > > > OpenMP
> > > > > > > > > > > makes anything slower.
> > > > > > > > > > > We have found the impact to be hardly measurable.
> > > > > > > > > > > The difference between GOMP and iOMP is <5% on our
> > > > benchmarks,
> > > > > > most
> > > > > > > > of
> > > > > > > > > > the
> > > > > > > > > > > time less than that.
> > > > > > > > > > >
> > > > > > > > > > > We just suggest to simplify the build of mxnet, by
> > removing
> > > > the
> > > > > > > > > > > unnecessary dependency.
> > > > > > > > > > >
> > > > > > > > > > > During that we discovered for example the following
> > amazing
> > > > > > issue:
> > > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/14087
> > > > > > > > > > >
> > > > > > > > > > > Best Regards
> > > > > > > > > > >
> > > > > > > > > > > Stas
> > > > > > > > > > >
> > > > > > > > > > > On 18.06.19, 18:24, "Chris Olivier" <
> > cjolivier01@gmail.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >     I am very reluctant to feed the trolls again, and
> > this
> > > > will
> > > > > > be
> > > > > > > > teh
> > > > > > > > > > last
> > > > > > > > > > >     time I address Pedro or Anton on the subject, but
> > since I
> > > > > > think
> > > > > > > > the
> > > > > > > > > > > numbers
> > > > > > > > > > >     being presented are incorrect (either by te builders
> > not
> > > > > > really
> > > > > > > > > > >     understanding what they are building, or possibly
> > > > intentional
> > > > > > > > > > > misdirection):
> > > > > > > > > > >
> > > > > > > > > > >     Turning Intel OMP on and off (and MKL as well, since
> > it
> > > > > > tends to
> > > > > > > > pull
> > > > > > > > > > > in
> > > > > > > > > > >     omp, depending which one is linked in).
> > > > > > > > > > >     There is a HUGE difference.  This is consistent with
> > my
> > > > > > > > experience
> > > > > > > > > > > before
> > > > > > > > > > >     when it was added.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >     default mnist:
> > > > > > > > > > >
> > > > > > > > > > >     python ../example/image-classification/train_mnist.py
> > > > > > > > > > >     INFO:root:start with arguments
> > Namespace(add_stn=False,
> > > > > > > > > > batch_size=64,
> > > > > > > > > > >     disp_batches=100, dtype='float32', gc_threshold=0.5,
> > > > > > > > gc_type='none',
> > > > > > > > > > >     gpus=None, image_shape='1, 28, 28',
> > > > initializer='default',
> > > > > > > > > > >     kv_store='device', load_epoch=None, loss='', lr=0.05,
> > > > > > > > lr_factor=0.1,
> > > > > > > > > > >     lr_step_epochs='10', macrobatch_size=0,
> > > > model_prefix=None,
> > > > > > > > mom=0.9,
> > > > > > > > > > >     monitor=0, network='mlp', num_classes=10,
> > num_epochs=20,
> > > > > > > > > > >     num_examples=60000, num_layers=None, optimizer='sgd',
> > > > > > > > > > >     profile_server_suffix='', profile_worker_suffix='',
> > > > > > > > save_period=1,
> > > > > > > > > > >     test_io=0, top_k=0, warmup_epochs=5,
> > > > > > warmup_strategy='linear',
> > > > > > > > > > > wd=0.0001)
> > > > > > > > > > >
> > > > > > > > > > >     INTEL OMP:
> > > > > > > > > > >
> > > > > > > > > > >     ldd libmxnet.so | grep omp
> > > > > > > > > > >             libomp.so =>
> > > > > > > > > > >
> > > > > > > >
> > > > /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > > > > > > > > > >     (0x00007f978fde7000)
> > > > > > > > > > >
> > > > > > > > > > >     :root:Epoch[0] Batch [0-100]        Speed: 31548.09
> > > > > > samples/sec
> > > > > > > > > > >     accuracy=0.780012
> > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed:
> > 16073.21
> > > > > > > > samples/sec
> > > > > > > > > > >     accuracy=0.920469
> > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed:
> > 19075.91
> > > > > > > > samples/sec
> > > > > > > > > > >     accuracy=0.928281
> > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed:
> > 23211.36
> > > > > > > > samples/sec
> > > > > > > > > > >     accuracy=0.942813
> > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed:
> > 22139.79
> > > > > > > > samples/sec
> > > > > > > > > > >     accuracy=0.938750
> > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed:
> > 23225.52
> > > > > > > > samples/sec
> > > > > > > > > > >     accuracy=0.946562
> > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed:
> > 19547.41
> > > > > > > > samples/sec
> > > > > > > > > > >     accuracy=0.953281
> > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed:
> > 24111.73
> > > > > > > > samples/sec
> > > > > > > > > > >     accuracy=0.951562
> > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed:
> > 13959.88
> > > > > > > > samples/sec
> > > > > > > > > > >     accuracy=0.957500
> > > > > > > > > > >     INFO:root:Epoch[0] Train-accuracy=0.925423
> > > > > > > > > > >     INFO:root:Epoch[0] Time cost=3.806
> > > > > > > > > > >     INFO:root:Epoch[0] Validation-accuracy=0.962580
> > > > > > > > > > >     INFO:root:Epoch[1] Batch [0-100]        Speed:
> > 24560.21
> > > > > > > > samples/sec
> > > > > > > > > > >     accuracy=0.968131
> > > > > > > > > > >     INFO:root:Epoch[1] Batch [100-200]      Speed:
> > 23457.03
> > > > > > > > samples/sec
> > > > > > > > > > >     accuracy=0.966250
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >     LIBGOMP:
> > > > > > > > > > >
> > > > > > > > > > >     ldd libmxnet.so | grep omp
> > > > > > > > > > >             libgomp.so.1 =>
> > > > > > /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > > > > > > >     (0x00007f25c25dd000)
> > > > > > > > > > >
> > > > > > > > > > >     INFO:root:Epoch[0] Batch [0-100]        Speed:
> > 1731.01
> > > > > > > > samples/sec
> > > > > > > > > > >      accuracy=0.782488
> > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed:
> > 3551.32
> > > > > > > > samples/sec
> > > > > > > > > > >      accuracy=0.907813
> > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed:
> > 1991.00
> > > > > > > > samples/sec
> > > > > > > > > > >      accuracy=0.927188
> > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed:
> > 2175.45
> > > > > > > > samples/sec
> > > > > > > > > > >      accuracy=0.937969
> > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed:
> > 1644.95
> > > > > > > > samples/sec
> > > > > > > > > > >      accuracy=0.942187
> > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed:
> > 6444.58
> > > > > > > > samples/sec
> > > > > > > > > > >      accuracy=0.950156
> > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed:
> > 7842.16
> > > > > > > > samples/sec
> > > > > > > > > > >      accuracy=0.947969
> > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed:
> > 9412.07
> > > > > > > > samples/sec
> > > > > > > > > > >      accuracy=0.953750
> > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed:
> > 12707.58
> > > > > > > > samples/sec
> > > > > > > > > > >     accuracy=0.953125
> > > > > > > > > > >
> > > > > > > > > > >     That being said, there's other issued beyond speed.
> > The
> > > > > > DEFAULT
> > > > > > > > > > build
> > > > > > > > > > > from
> > > > > > > > > > >     makefile (not CMake) uses Intel OMP mkl (I showed
> > > > before) and
> > > > > > > > > > > mysteriously
> > > > > > > > > > >     it has no issues?  This seems highly suspicious.
> > All I
> > > > see
> > > > > > is a
> > > > > > > > lot
> > > > > > > > > > of
> > > > > > > > > > >     hand-waving and conjecture and pointing to
> > StackOverflow
> > > > > > posts
> > > > > > > > made
> > > > > > > > > > by
> > > > > > > > > > >     people who may be of questionable pedigree to begin
> > with.
> > > > > > This
> > > > > > > > > > smells
> > > > > > > > > > > of a
> > > > > > > > > > >     Pedro-ego-fight rather than one of purely technical
> > > > merit.
> > > > > > > > Also, if
> > > > > > > > > > > one
> > > > > > > > > > >     knows how OMP works,  they would be very suspicious
> > of
> > > > the
> > > > > > > > > > > "intermittent
> > > > > > > > > > >     hangs" claim -- that's probably just broken race
> > > > conditions
> > > > > > > > elsewhere
> > > > > > > > > > > until
> > > > > > > > > > >     proven differently.  It'd tend freeze on the first
> > use if
> > > > > > > > something
> > > > > > > > > > is
> > > > > > > > > > >     wrong (try using libgomp after a fork and see), since
> > > > worker
> > > > > > > > threads"
> > > > > > > > > > >     wouldn't be assigned/joined properly.  IntelOMP is
> > > > faster,
> > > > > > but
> > > > > > > > also
> > > > > > > > > > has
> > > > > > > > > > >     other advantages, such as allowing OMP after a fork.
> > > > > > > > > > >
> > > > > > > > > > >     I actually addressed a lot of issues and ask for
> > > > > > clarification
> > > > > > > > in the
> > > > > > > > > > >     original PR's way back when, but they're all just
> > > > ignored.
> > > > > > > > > > >
> > > > > > > > > > >     -Chris
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >

Re: OMP

Posted by Chris Olivier <cj...@gmail.com>.
one major advantage of intel/llvm omp is that it spawns a new thread pool
after fork if a thread pool was already created. this is so that omp can be
used in the forked processes. libgomp doesn’t do this so it’ll just lock up
if you try to do omp in the forked process.

is your build linking libgomp as well?

standard mkl build (from Makefile) uses same omp library. are there
problems with that build?

what changes need to be made to make the assertion not fire?

On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy <pe...@gmail.com>
wrote:

> There's an assertion which is easily reproducible, and also there's a
> crash including core dump, the latter is not easy to reproduce for me
> in different environments. I have also seen mxnet getting stuck
> without progressing with this build configuration and using no CPU at
> all when running unit tests.
>
> In my view, the root cause of the assertion is that we are re-entering
> OMP initialization when spawning threads on the following code through
> pthread_at_fork
>
> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
>
> This causes double initialization of the OMP engine, including the
> assertion which you are asking about,  and I suspect some additional
> overhead. That's the shady forking part you are asking for.
>
> A question for you: What is the cause of runtime differences between
> OMP runtimes? Shouldn't the implementation overhead diminish as
> threads run longer?
>
> Pedro.
>
> On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier <cj...@gmail.com>
> wrote:
> >
> > What’s the reason for the assertion failure? btw classifying an assertion
> > failure a “crash” is debatable. As I stated in the original issue a long
> > time ago, it’s possible something shady is being done with when forking
> > that should be fixed.  The assertion should be root caused.
> >
> >
> >
> > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <
> pedro.larroy.lists@gmail.com>
> > wrote:
> >
> > > Added a dockerfile, and reports of a crash in my local machine when
> > > running MKL+OMP+DEBUG, with Anton's branch the crash happened as well.
> > > I couldn't reproduce the crash on my EC2 machine:
> > > Added the backtrace of the crash as well.
> > >
> > > https://github.com/apache/incubator-mxnet/issues/10856
> > >
> > > Dockerfile here:
> > >
> > > https://github.com/larroy/mxnet_omp
> > >
> > > Kind regards.
> > >
> > > Pedro.
> > >
> > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <
> marco.g.abreu@gmail.com>
> > > wrote:
> > > >
> > > > As already proposed, I think the easiest way to get a common
> > > understanding
> > > > is if we start with a few docker containers. Pedro, would it be
> possible
> > > > for you to wrap your benchmarks into a few containers that will
> produce
> > > > your shown results? That way, we can avoid possible
> misunderstandings and
> > > > also pinpoint the exact parts where people disagree or misunderstood
> each
> > > > other.
> > > >
> > > > -Marco
> > > >
> > > > Pedro Larroy <pe...@gmail.com> schrieb am Do., 20. Juni
> > > 2019,
> > > > 21:47:
> > > >
> > > > > I can confirm that we are linking with two versions of omp, I'm
> > > > > gaining more clarity into this topic, but I have still questions,
> the
> > > > > facts that I got so far are the folllowing:
> > > > >
> > > > > * #1: We are linking with two versions of omp, intel's omp and llvm
> > > > > openmp when building with MKL enabled.
> > > > > * #2: We have 3 different possible OMP versions: Intel OMP (comes
> with
> > > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
> > > > > one is used on the PR proposed by Anton).
> > > > >
> > > > > Questions:
> > > > >
> > > > >  * #1 Is it ok to have two versions of openmp linked at the same
> time?
> > > > >  * #2 Which implementation of OMP gives the best performance?  (See
> > > > > total training time of my measurement for a partial answer)
> > > > >  * #3 Should we have a build flag so we can choose the OMP version
> at
> > > > > runtime?
> > > > >  * #4 Which Compiler and build flags did Chris use to get 10x
> slowdown?
> > > > >  * #5 @Stas: is there a script to replicate your benchmarks
> easily? If
> > > > > so could you provide a link?  I think we would need to reproduce
> your
> > > > > benchmarks and verify which versions are being linked. It's
> possible
> > > > > that while compiling with MKL intel's omp was pulled in instead of
> > > > > GNU OpenMP.
> > > > >  * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we
> > > > > update the subrepo regularly?
> > > > >
> > > > > My conclusion so far:
> > > > >
> > > > >  * #1 We should avoid linking two versions of omp if possible and
> > > > > allow users to choose one in the build as we do for BLAS.
> > > > >  * #2 For performance reasons and more control vs different
> compiler
> > > > > versions seems it makes indeed sense to keep the LLVM OpenMP
> version
> > > > > in 3rdparty for now. So unless some more data is gathered, it makes
> > > > > sense not to remove it as of now.
> > > > >  * #3 We should provide build options to choose which openmp
> library
> > > > > is to be used from the three options available, including libgomp.
> > > > >  * #4 Refining the build we could also enable OpenMP in mac without
> > > > > additional contortions (doesn't work as of today):
> > > > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
> > > > >  * #5 We should add different omp versions to our benchmarks and
> track
> > > > > the performance, so this data is available for prescribing the best
> > > > > build options and for binary releases.
> > > > >
> > > > > This is also an interesting related gh issue posted in the mkl-dnn
> > > > > repository:  https://github.com/intel/mkl-dnn/issues/230
> > > > >
> > > > >
> > > > > I don't observe the order of magnitude divergence reported by
> Chris in
> > > > > vanilla Ubuntu 18.04 in samples / s but the full training finishes
> > > > > indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp.
> > > > >
> > > > > There's also differences in training time when using MKL and the ,
> > > > > it's actually a bit slower, I don't know if it's related to OMP.
> > > > >
> > > > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
> > > > >
> > > > > Anton's branch:  git@github.com:lebeg/incubator-mxnet.git   branch
> > > 'omp'
> > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> > > > > build/libmxnet.so |grep -i omp
> > > > >         libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > (0x00007fd99a51d000)
> > > > >
> > > > > time python train_mnist.py
> > > > >
> > > > > INFO:root:Epoch[18] Validation-accuracy=0.984176
> > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 41617.00 samples/sec
> > > > >  accuracy=1.000000
> > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 47990.69 samples/sec
> > > > >  accuracy=0.999531
> > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 47517.01 samples/sec
> > > > >  accuracy=0.999687
> > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 47430.53 samples/sec
> > > > >  accuracy=1.000000
> > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 47649.77 samples/sec
> > > > >  accuracy=0.999687
> > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 51708.12 samples/sec
> > > > >  accuracy=0.999687
> > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 57228.63 samples/sec
> > > > >  accuracy=0.999375
> > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 50887.85 samples/sec
> > > > >  accuracy=0.999844
> > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 53947.98 samples/sec
> > > > >  accuracy=0.999531
> > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > > > INFO:root:Epoch[19] Time cost=1.219
> > > > > INFO:root:Epoch[19] Validation-accuracy=0.983977
> > > > > 1011.98user 26.78system 0:31.54elapsed 3292%CPU (0avgtext+0avgdata
> > > > > 1146052maxresident)k
> > > > > 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps
> > > > >
> > > > > Master, MKL ON:
> > > > >
> > > > > (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification [master]>
> ldd
> > > > > ../../build/libmxnet.so | grep -i omp
> > > > >         libomp.so =>
> > > > >
> /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > > > (0x00007f05ba38f000)
> > > > >         libiomp5.so =>
> > > > >
> > > > >
> > >
> /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
> > > > > (0x00007f05b09f4000)
> > > > >
> > > > > INFO:root:Epoch[18] Validation-accuracy=0.982484
> > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 36651.63 samples/sec
> > > > >  accuracy=0.999691
> > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 45093.98 samples/sec
> > > > >  accuracy=0.999844
> > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 45146.84 samples/sec
> > > > >  accuracy=0.999687
> > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 45119.90 samples/sec
> > > > >  accuracy=0.999687
> > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 44998.96 samples/sec
> > > > >  accuracy=0.999531
> > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 45072.25 samples/sec
> > > > >  accuracy=0.999844
> > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 44969.79 samples/sec
> > > > >  accuracy=0.999844
> > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 44962.78 samples/sec
> > > > >  accuracy=0.999844
> > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 44945.47 samples/sec
> > > > >  accuracy=0.999375
> > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > > > INFO:root:Epoch[19] Time cost=1.367
> > > > > INFO:root:Epoch[19] Validation-accuracy=0.982783
> > > > > 854.97user 847.21system 0:41.44elapsed 4106%CPU (0avgtext+0avgdata
> > > > > 1154348maxresident)k
> > > > > 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps
> > > > >
> > > > >
> > > > > MKL OFF:
> > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> grep -i MKL
> > > > > cmake_options.yml
> > > > > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
> > > > > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
> > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
> > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
> > > > > build/libmxnet.so |grep -i omp
> > > > >         libomp.so =>
> > > > >
> /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > > > (0x00007fb720c54000)
> > > > >
> > > > > INFO:root:Epoch[18] Validation-accuracy=0.983479
> > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 46784.02 samples/sec
> > > > >  accuracy=1.000000
> > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 48824.29 samples/sec
> > > > >  accuracy=0.999687
> > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 49190.31 samples/sec
> > > > >  accuracy=0.999687
> > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 51518.77 samples/sec
> > > > >  accuracy=0.999844
> > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 51551.62 samples/sec
> > > > >  accuracy=0.999844
> > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 49026.35 samples/sec
> > > > >  accuracy=0.999844
> > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 49002.46 samples/sec
> > > > >  accuracy=0.999375
> > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 48980.55 samples/sec
> > > > >  accuracy=0.999687
> > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 47402.56 samples/sec
> > > > >  accuracy=0.999844
> > > > > INFO:root:Epoch[19] Train-accuracy=0.999767
> > > > > INFO:root:Epoch[19] Time cost=1.259
> > > > > INFO:root:Epoch[19] Validation-accuracy=0.983181
> > > > > 755.36user 754.94system 0:35.89elapsed 4207%CPU (0avgtext+0avgdata
> > > > > 1147008maxresident)k
> > > > > 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps
> > > > >
> > > > > Let me know what you think.
> > > > >
> > > > > Link to the original PR:
> > > > > https://github.com/apache/incubator-mxnet/pull/12160
> > > > >
> > > > > Thanks.
> > > > >
> > > > > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
> > > > > <ke...@gmail.com> wrote:
> > > > > >
> > > > > > "if you’re linking in two then you’re doing something wrong."
> > > Correct,
> > > > > > that's one thing I believe we've got consensus on.  So let's call
> > > that
> > > > > out
> > > > > > as a bug to be fixed.
> > > > > >
> > > > > > Let's move forward with some reproducible numbers and then
> discuss
> > > the
> > > > > pros
> > > > > > / cons of which particular OMP implementation we should use.
> > > > > >
> > > > > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <
> > > > > pedro.larroy.lists@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Chris
> > > > > > >
> > > > > > > I would ask you to have a bit of patience and help us with your
> > > > > > > experience in this matter. Nobody is ignoring anything, I
> think we
> > > are
> > > > > > > individually gathering feedbacks and trying to understand the
> > > multiple
> > > > > > > contributions done to this topic including yours, then go step
> by
> > > > > > > step, understand what is going on and run experiments and
> report
> > > back
> > > > > > > to the list or the corresponding github item. It was suggested
> by
> > > > > > > Kellen to prepare some containers, this takes effort.
> > > > > > >
> > > > > > > Regarding your final comment, most of us also have many other
> > > things
> > > > > > > to do and responsibilities even if our daytime jobs might
> involve
> > > > > > > MXNet in some form or another. I think that's part of the
> privilege
> > > > > > > and responsibility of working close with an open source
> project and
> > > > > > > the magic of collaboration across organizations. Let's all be
> > > patient
> > > > > > > and take some time to understand and reason about this topic
> which
> > > is
> > > > > > > not simple. Since we decided to step back and gather more data
> > > let's
> > > > > > > take time and do it properly.
> > > > > > >
> > > > > > > Personally I hope to find time to look again into this issue
> before
> > > > > > > the end of the week.
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > Pedro.
> > > > > > >
> > > > > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <
> > > cjolivier01@apache.org>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > if you’re linking in two then you’re doing something wrong.
> You
> > > can
> > > > > see
> > > > > > > by
> > > > > > > > my email yesterday that only one is linked in. This is also
> the
> > > case
> > > > > with
> > > > > > > > the mkl version built by the Makefile — only the Intel OMP
> > > library is
> > > > > > > used
> > > > > > > > (no libgomp).
> > > > > > > >
> > > > > > > > That being said, Do you have clear evidence that using Intel
> OMP
> > > is
> > > > > both
> > > > > > > > problematic and the situation isn’t fixable?  The burden of
> > > proof is
> > > > > on
> > > > > > > the
> > > > > > > > ones requesting the change — it is not my responsibility to
> > > justify
> > > > > the
> > > > > > > > current state.  There must be something “terrible” and
> unfixable
> > > to
> > > > > > > justify
> > > > > > > > a change.  I have seen no proof of this in all this time.
> > > > > > > >
> > > > > > > > On a side note, I mentioned a couple of things in my email
> > > yesterday
> > > > > that
> > > > > > > > still are not being responded to (they were also ignored in
> the
> > > last
> > > > > > > > incarnation of this “discussion” — I have much experience in
> this
> > > > > matter
> > > > > > > to
> > > > > > > > assume “discussion” is a waste of my time, seeing and I am
> not
> > > paid
> > > > > to
> > > > > > > > “work on” mxnet like y’all are).
> > > > > > > >
> > > > > > > > -C
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> > > > > > > > kellen.sunderland@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > I've also quite often seen two versions of OpenMP linked.
> I
> > > think
> > > > > we
> > > > > > > can
> > > > > > > > > all agree we probably want to avoid linking in two
> libraries
> > > that
> > > > > do
> > > > > > > > > effectively the same thing.
> > > > > > > > >
> > > > > > > > > The performance questions should be fairly straight
> forward to
> > > > > > > demonstrate
> > > > > > > > > right?  Could we just collaborate on a few minimal
> Dockerfiles
> > > that
> > > > > > > show
> > > > > > > > > (or don't show) Intel OpenMP performance speedups with the
> > > > > workloads
> > > > > > > Chris
> > > > > > > > > is referencing?
> > > > > > > > >
> > > > > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> > > > > > > > > stanislav.tsukrov@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > Hi, Chris!
> > > > > > > > > >
> > > > > > > > > > Stas here - I've gathered that performance data.
> > > > > > > > > > Sure thing, I can be wrong, but please elaborate a bit on
> > > what
> > > > > we are
> > > > > > > > > > missing.
> > > > > > > > > > Be assured, intentional misdirection was never a case.
> > > > > > > > > >
> > > > > > > > > > Thanks a lot for being constructive.
> > > > > > > > > >
> > > > > > > > > > > Turning Intel OMP on and off (and MKL as well, since it
> > > tends
> > > > > to
> > > > > > > pull
> > > > > > > > > in
> > > > > > > > > > omp, depending which one is linked in).
> > > > > > > > > >
> > > > > > > > > > We never ever considered turning MKL off. We are on the
> same
> > > page
> > > > > > > here -
> > > > > > > > > > MKL is crucial for the performance.
> > > > > > > > > > Why should we? There's a GOMP-linked version of MKL,
> that we
> > > can
> > > > > use.
> > > > > > > > > >
> > > > > > > > > > What we did - we measured, if using compilers default
> OpenMP
> > > > > > > > > > implementation instead of referenced source code
> > > distribution of
> > > > > > > OpenMP
> > > > > > > > > > makes anything slower.
> > > > > > > > > > We have found the impact to be hardly measurable.
> > > > > > > > > > The difference between GOMP and iOMP is <5% on our
> > > benchmarks,
> > > > > most
> > > > > > > of
> > > > > > > > > the
> > > > > > > > > > time less than that.
> > > > > > > > > >
> > > > > > > > > > We just suggest to simplify the build of mxnet, by
> removing
> > > the
> > > > > > > > > > unnecessary dependency.
> > > > > > > > > >
> > > > > > > > > > During that we discovered for example the following
> amazing
> > > > > issue:
> > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/14087
> > > > > > > > > >
> > > > > > > > > > Best Regards
> > > > > > > > > >
> > > > > > > > > > Stas
> > > > > > > > > >
> > > > > > > > > > On 18.06.19, 18:24, "Chris Olivier" <
> cjolivier01@gmail.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >     I am very reluctant to feed the trolls again, and
> this
> > > will
> > > > > be
> > > > > > > teh
> > > > > > > > > last
> > > > > > > > > >     time I address Pedro or Anton on the subject, but
> since I
> > > > > think
> > > > > > > the
> > > > > > > > > > numbers
> > > > > > > > > >     being presented are incorrect (either by te builders
> not
> > > > > really
> > > > > > > > > >     understanding what they are building, or possibly
> > > intentional
> > > > > > > > > > misdirection):
> > > > > > > > > >
> > > > > > > > > >     Turning Intel OMP on and off (and MKL as well, since
> it
> > > > > tends to
> > > > > > > pull
> > > > > > > > > > in
> > > > > > > > > >     omp, depending which one is linked in).
> > > > > > > > > >     There is a HUGE difference.  This is consistent with
> my
> > > > > > > experience
> > > > > > > > > > before
> > > > > > > > > >     when it was added.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >     default mnist:
> > > > > > > > > >
> > > > > > > > > >     python ../example/image-classification/train_mnist.py
> > > > > > > > > >     INFO:root:start with arguments
> Namespace(add_stn=False,
> > > > > > > > > batch_size=64,
> > > > > > > > > >     disp_batches=100, dtype='float32', gc_threshold=0.5,
> > > > > > > gc_type='none',
> > > > > > > > > >     gpus=None, image_shape='1, 28, 28',
> > > initializer='default',
> > > > > > > > > >     kv_store='device', load_epoch=None, loss='', lr=0.05,
> > > > > > > lr_factor=0.1,
> > > > > > > > > >     lr_step_epochs='10', macrobatch_size=0,
> > > model_prefix=None,
> > > > > > > mom=0.9,
> > > > > > > > > >     monitor=0, network='mlp', num_classes=10,
> num_epochs=20,
> > > > > > > > > >     num_examples=60000, num_layers=None, optimizer='sgd',
> > > > > > > > > >     profile_server_suffix='', profile_worker_suffix='',
> > > > > > > save_period=1,
> > > > > > > > > >     test_io=0, top_k=0, warmup_epochs=5,
> > > > > warmup_strategy='linear',
> > > > > > > > > > wd=0.0001)
> > > > > > > > > >
> > > > > > > > > >     INTEL OMP:
> > > > > > > > > >
> > > > > > > > > >     ldd libmxnet.so | grep omp
> > > > > > > > > >             libomp.so =>
> > > > > > > > > >
> > > > > > >
> > > /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > > > > > > > > >     (0x00007f978fde7000)
> > > > > > > > > >
> > > > > > > > > >     :root:Epoch[0] Batch [0-100]        Speed: 31548.09
> > > > > samples/sec
> > > > > > > > > >     accuracy=0.780012
> > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed:
> 16073.21
> > > > > > > samples/sec
> > > > > > > > > >     accuracy=0.920469
> > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed:
> 19075.91
> > > > > > > samples/sec
> > > > > > > > > >     accuracy=0.928281
> > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed:
> 23211.36
> > > > > > > samples/sec
> > > > > > > > > >     accuracy=0.942813
> > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed:
> 22139.79
> > > > > > > samples/sec
> > > > > > > > > >     accuracy=0.938750
> > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed:
> 23225.52
> > > > > > > samples/sec
> > > > > > > > > >     accuracy=0.946562
> > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed:
> 19547.41
> > > > > > > samples/sec
> > > > > > > > > >     accuracy=0.953281
> > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed:
> 24111.73
> > > > > > > samples/sec
> > > > > > > > > >     accuracy=0.951562
> > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed:
> 13959.88
> > > > > > > samples/sec
> > > > > > > > > >     accuracy=0.957500
> > > > > > > > > >     INFO:root:Epoch[0] Train-accuracy=0.925423
> > > > > > > > > >     INFO:root:Epoch[0] Time cost=3.806
> > > > > > > > > >     INFO:root:Epoch[0] Validation-accuracy=0.962580
> > > > > > > > > >     INFO:root:Epoch[1] Batch [0-100]        Speed:
> 24560.21
> > > > > > > samples/sec
> > > > > > > > > >     accuracy=0.968131
> > > > > > > > > >     INFO:root:Epoch[1] Batch [100-200]      Speed:
> 23457.03
> > > > > > > samples/sec
> > > > > > > > > >     accuracy=0.966250
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >     LIBGOMP:
> > > > > > > > > >
> > > > > > > > > >     ldd libmxnet.so | grep omp
> > > > > > > > > >             libgomp.so.1 =>
> > > > > /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > > > > > >     (0x00007f25c25dd000)
> > > > > > > > > >
> > > > > > > > > >     INFO:root:Epoch[0] Batch [0-100]        Speed:
> 1731.01
> > > > > > > samples/sec
> > > > > > > > > >      accuracy=0.782488
> > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed:
> 3551.32
> > > > > > > samples/sec
> > > > > > > > > >      accuracy=0.907813
> > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed:
> 1991.00
> > > > > > > samples/sec
> > > > > > > > > >      accuracy=0.927188
> > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed:
> 2175.45
> > > > > > > samples/sec
> > > > > > > > > >      accuracy=0.937969
> > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed:
> 1644.95
> > > > > > > samples/sec
> > > > > > > > > >      accuracy=0.942187
> > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed:
> 6444.58
> > > > > > > samples/sec
> > > > > > > > > >      accuracy=0.950156
> > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed:
> 7842.16
> > > > > > > samples/sec
> > > > > > > > > >      accuracy=0.947969
> > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed:
> 9412.07
> > > > > > > samples/sec
> > > > > > > > > >      accuracy=0.953750
> > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed:
> 12707.58
> > > > > > > samples/sec
> > > > > > > > > >     accuracy=0.953125
> > > > > > > > > >
> > > > > > > > > >     That being said, there's other issued beyond speed.
> The
> > > > > DEFAULT
> > > > > > > > > build
> > > > > > > > > > from
> > > > > > > > > >     makefile (not CMake) uses Intel OMP mkl (I showed
> > > before) and
> > > > > > > > > > mysteriously
> > > > > > > > > >     it has no issues?  This seems highly suspicious.
> All I
> > > see
> > > > > is a
> > > > > > > lot
> > > > > > > > > of
> > > > > > > > > >     hand-waving and conjecture and pointing to
> StackOverflow
> > > > > posts
> > > > > > > made
> > > > > > > > > by
> > > > > > > > > >     people who may be of questionable pedigree to begin
> with.
> > > > > This
> > > > > > > > > smells
> > > > > > > > > > of a
> > > > > > > > > >     Pedro-ego-fight rather than one of purely technical
> > > merit.
> > > > > > > Also, if
> > > > > > > > > > one
> > > > > > > > > >     knows how OMP works,  they would be very suspicious
> of
> > > the
> > > > > > > > > > "intermittent
> > > > > > > > > >     hangs" claim -- that's probably just broken race
> > > conditions
> > > > > > > elsewhere
> > > > > > > > > > until
> > > > > > > > > >     proven differently.  It'd tend freeze on the first
> use if
> > > > > > > something
> > > > > > > > > is
> > > > > > > > > >     wrong (try using libgomp after a fork and see), since
> > > worker
> > > > > > > threads"
> > > > > > > > > >     wouldn't be assigned/joined properly.  IntelOMP is
> > > faster,
> > > > > but
> > > > > > > also
> > > > > > > > > has
> > > > > > > > > >     other advantages, such as allowing OMP after a fork.
> > > > > > > > > >
> > > > > > > > > >     I actually addressed a lot of issues and ask for
> > > > > clarification
> > > > > > > in the
> > > > > > > > > >     original PR's way back when, but they're all just
> > > ignored.
> > > > > > > > > >
> > > > > > > > > >     -Chris
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
>

Re: OMP

Posted by Pedro Larroy <pe...@gmail.com>.
There's an assertion which is easily reproducible, and also there's a
crash including core dump, the latter is not easy to reproduce for me
in different environments. I have also seen mxnet getting stuck
without progressing with this build configuration and using no CPU at
all when running unit tests.

In my view, the root cause of the assertion is that we are re-entering
OMP initialization when spawning threads on the following code through
pthread_at_fork

https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58

This causes double initialization of the OMP engine, including the
assertion which you are asking about,  and I suspect some additional
overhead. That's the shady forking part you are asking for.

A question for you: What is the cause of runtime differences between
OMP runtimes? Shouldn't the implementation overhead diminish as
threads run longer?

Pedro.

On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier <cj...@gmail.com> wrote:
>
> What’s the reason for the assertion failure? btw classifying an assertion
> failure a “crash” is debatable. As I stated in the original issue a long
> time ago, it’s possible something shady is being done with when forking
> that should be fixed.  The assertion should be root caused.
>
>
>
> On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <pe...@gmail.com>
> wrote:
>
> > Added a dockerfile, and reports of a crash in my local machine when
> > running MKL+OMP+DEBUG, with Anton's branch the crash happened as well.
> > I couldn't reproduce the crash on my EC2 machine:
> > Added the backtrace of the crash as well.
> >
> > https://github.com/apache/incubator-mxnet/issues/10856
> >
> > Dockerfile here:
> >
> > https://github.com/larroy/mxnet_omp
> >
> > Kind regards.
> >
> > Pedro.
> >
> > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <ma...@gmail.com>
> > wrote:
> > >
> > > As already proposed, I think the easiest way to get a common
> > understanding
> > > is if we start with a few docker containers. Pedro, would it be possible
> > > for you to wrap your benchmarks into a few containers that will produce
> > > your shown results? That way, we can avoid possible misunderstandings and
> > > also pinpoint the exact parts where people disagree or misunderstood each
> > > other.
> > >
> > > -Marco
> > >
> > > Pedro Larroy <pe...@gmail.com> schrieb am Do., 20. Juni
> > 2019,
> > > 21:47:
> > >
> > > > I can confirm that we are linking with two versions of omp, I'm
> > > > gaining more clarity into this topic, but I have still questions, the
> > > > facts that I got so far are the folllowing:
> > > >
> > > > * #1: We are linking with two versions of omp, intel's omp and llvm
> > > > openmp when building with MKL enabled.
> > > > * #2: We have 3 different possible OMP versions: Intel OMP (comes with
> > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
> > > > one is used on the PR proposed by Anton).
> > > >
> > > > Questions:
> > > >
> > > >  * #1 Is it ok to have two versions of openmp linked at the same time?
> > > >  * #2 Which implementation of OMP gives the best performance?  (See
> > > > total training time of my measurement for a partial answer)
> > > >  * #3 Should we have a build flag so we can choose the OMP version at
> > > > runtime?
> > > >  * #4 Which Compiler and build flags did Chris use to get 10x slowdown?
> > > >  * #5 @Stas: is there a script to replicate your benchmarks easily? If
> > > > so could you provide a link?  I think we would need to reproduce your
> > > > benchmarks and verify which versions are being linked. It's possible
> > > > that while compiling with MKL intel's omp was pulled in instead of
> > > > GNU OpenMP.
> > > >  * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we
> > > > update the subrepo regularly?
> > > >
> > > > My conclusion so far:
> > > >
> > > >  * #1 We should avoid linking two versions of omp if possible and
> > > > allow users to choose one in the build as we do for BLAS.
> > > >  * #2 For performance reasons and more control vs different compiler
> > > > versions seems it makes indeed sense to keep the LLVM OpenMP version
> > > > in 3rdparty for now. So unless some more data is gathered, it makes
> > > > sense not to remove it as of now.
> > > >  * #3 We should provide build options to choose which openmp library
> > > > is to be used from the three options available, including libgomp.
> > > >  * #4 Refining the build we could also enable OpenMP in mac without
> > > > additional contortions (doesn't work as of today):
> > > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
> > > >  * #5 We should add different omp versions to our benchmarks and track
> > > > the performance, so this data is available for prescribing the best
> > > > build options and for binary releases.
> > > >
> > > > This is also an interesting related gh issue posted in the mkl-dnn
> > > > repository:  https://github.com/intel/mkl-dnn/issues/230
> > > >
> > > >
> > > > I don't observe the order of magnitude divergence reported by Chris in
> > > > vanilla Ubuntu 18.04 in samples / s but the full training finishes
> > > > indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp.
> > > >
> > > > There's also differences in training time when using MKL and the ,
> > > > it's actually a bit slower, I don't know if it's related to OMP.
> > > >
> > > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
> > > >
> > > > Anton's branch:  git@github.com:lebeg/incubator-mxnet.git   branch
> > 'omp'
> > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> > > > build/libmxnet.so |grep -i omp
> > > >         libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > (0x00007fd99a51d000)
> > > >
> > > > time python train_mnist.py
> > > >
> > > > INFO:root:Epoch[18] Validation-accuracy=0.984176
> > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 41617.00 samples/sec
> > > >  accuracy=1.000000
> > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 47990.69 samples/sec
> > > >  accuracy=0.999531
> > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 47517.01 samples/sec
> > > >  accuracy=0.999687
> > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 47430.53 samples/sec
> > > >  accuracy=1.000000
> > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 47649.77 samples/sec
> > > >  accuracy=0.999687
> > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 51708.12 samples/sec
> > > >  accuracy=0.999687
> > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 57228.63 samples/sec
> > > >  accuracy=0.999375
> > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 50887.85 samples/sec
> > > >  accuracy=0.999844
> > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 53947.98 samples/sec
> > > >  accuracy=0.999531
> > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > > INFO:root:Epoch[19] Time cost=1.219
> > > > INFO:root:Epoch[19] Validation-accuracy=0.983977
> > > > 1011.98user 26.78system 0:31.54elapsed 3292%CPU (0avgtext+0avgdata
> > > > 1146052maxresident)k
> > > > 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps
> > > >
> > > > Master, MKL ON:
> > > >
> > > > (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification [master]> ldd
> > > > ../../build/libmxnet.so | grep -i omp
> > > >         libomp.so =>
> > > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > > (0x00007f05ba38f000)
> > > >         libiomp5.so =>
> > > >
> > > >
> > /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
> > > > (0x00007f05b09f4000)
> > > >
> > > > INFO:root:Epoch[18] Validation-accuracy=0.982484
> > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 36651.63 samples/sec
> > > >  accuracy=0.999691
> > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 45093.98 samples/sec
> > > >  accuracy=0.999844
> > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 45146.84 samples/sec
> > > >  accuracy=0.999687
> > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 45119.90 samples/sec
> > > >  accuracy=0.999687
> > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 44998.96 samples/sec
> > > >  accuracy=0.999531
> > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 45072.25 samples/sec
> > > >  accuracy=0.999844
> > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 44969.79 samples/sec
> > > >  accuracy=0.999844
> > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 44962.78 samples/sec
> > > >  accuracy=0.999844
> > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 44945.47 samples/sec
> > > >  accuracy=0.999375
> > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > > INFO:root:Epoch[19] Time cost=1.367
> > > > INFO:root:Epoch[19] Validation-accuracy=0.982783
> > > > 854.97user 847.21system 0:41.44elapsed 4106%CPU (0avgtext+0avgdata
> > > > 1154348maxresident)k
> > > > 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps
> > > >
> > > >
> > > > MKL OFF:
> > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> grep -i MKL
> > > > cmake_options.yml
> > > > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
> > > > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
> > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
> > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
> > > > build/libmxnet.so |grep -i omp
> > > >         libomp.so =>
> > > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > > (0x00007fb720c54000)
> > > >
> > > > INFO:root:Epoch[18] Validation-accuracy=0.983479
> > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 46784.02 samples/sec
> > > >  accuracy=1.000000
> > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 48824.29 samples/sec
> > > >  accuracy=0.999687
> > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 49190.31 samples/sec
> > > >  accuracy=0.999687
> > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 51518.77 samples/sec
> > > >  accuracy=0.999844
> > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 51551.62 samples/sec
> > > >  accuracy=0.999844
> > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 49026.35 samples/sec
> > > >  accuracy=0.999844
> > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 49002.46 samples/sec
> > > >  accuracy=0.999375
> > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 48980.55 samples/sec
> > > >  accuracy=0.999687
> > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 47402.56 samples/sec
> > > >  accuracy=0.999844
> > > > INFO:root:Epoch[19] Train-accuracy=0.999767
> > > > INFO:root:Epoch[19] Time cost=1.259
> > > > INFO:root:Epoch[19] Validation-accuracy=0.983181
> > > > 755.36user 754.94system 0:35.89elapsed 4207%CPU (0avgtext+0avgdata
> > > > 1147008maxresident)k
> > > > 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps
> > > >
> > > > Let me know what you think.
> > > >
> > > > Link to the original PR:
> > > > https://github.com/apache/incubator-mxnet/pull/12160
> > > >
> > > > Thanks.
> > > >
> > > > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
> > > > <ke...@gmail.com> wrote:
> > > > >
> > > > > "if you’re linking in two then you’re doing something wrong."
> > Correct,
> > > > > that's one thing I believe we've got consensus on.  So let's call
> > that
> > > > out
> > > > > as a bug to be fixed.
> > > > >
> > > > > Let's move forward with some reproducible numbers and then discuss
> > the
> > > > pros
> > > > > / cons of which particular OMP implementation we should use.
> > > > >
> > > > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <
> > > > pedro.larroy.lists@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Chris
> > > > > >
> > > > > > I would ask you to have a bit of patience and help us with your
> > > > > > experience in this matter. Nobody is ignoring anything, I think we
> > are
> > > > > > individually gathering feedbacks and trying to understand the
> > multiple
> > > > > > contributions done to this topic including yours, then go step by
> > > > > > step, understand what is going on and run experiments and report
> > back
> > > > > > to the list or the corresponding github item. It was suggested by
> > > > > > Kellen to prepare some containers, this takes effort.
> > > > > >
> > > > > > Regarding your final comment, most of us also have many other
> > things
> > > > > > to do and responsibilities even if our daytime jobs might involve
> > > > > > MXNet in some form or another. I think that's part of the privilege
> > > > > > and responsibility of working close with an open source project and
> > > > > > the magic of collaboration across organizations. Let's all be
> > patient
> > > > > > and take some time to understand and reason about this topic which
> > is
> > > > > > not simple. Since we decided to step back and gather more data
> > let's
> > > > > > take time and do it properly.
> > > > > >
> > > > > > Personally I hope to find time to look again into this issue before
> > > > > > the end of the week.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > Pedro.
> > > > > >
> > > > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <
> > cjolivier01@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > if you’re linking in two then you’re doing something wrong. You
> > can
> > > > see
> > > > > > by
> > > > > > > my email yesterday that only one is linked in. This is also the
> > case
> > > > with
> > > > > > > the mkl version built by the Makefile — only the Intel OMP
> > library is
> > > > > > used
> > > > > > > (no libgomp).
> > > > > > >
> > > > > > > That being said, Do you have clear evidence that using Intel OMP
> > is
> > > > both
> > > > > > > problematic and the situation isn’t fixable?  The burden of
> > proof is
> > > > on
> > > > > > the
> > > > > > > ones requesting the change — it is not my responsibility to
> > justify
> > > > the
> > > > > > > current state.  There must be something “terrible” and unfixable
> > to
> > > > > > justify
> > > > > > > a change.  I have seen no proof of this in all this time.
> > > > > > >
> > > > > > > On a side note, I mentioned a couple of things in my email
> > yesterday
> > > > that
> > > > > > > still are not being responded to (they were also ignored in the
> > last
> > > > > > > incarnation of this “discussion” — I have much experience in this
> > > > matter
> > > > > > to
> > > > > > > assume “discussion” is a waste of my time, seeing and I am not
> > paid
> > > > to
> > > > > > > “work on” mxnet like y’all are).
> > > > > > >
> > > > > > > -C
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> > > > > > > kellen.sunderland@gmail.com> wrote:
> > > > > > >
> > > > > > > > I've also quite often seen two versions of OpenMP linked.  I
> > think
> > > > we
> > > > > > can
> > > > > > > > all agree we probably want to avoid linking in two libraries
> > that
> > > > do
> > > > > > > > effectively the same thing.
> > > > > > > >
> > > > > > > > The performance questions should be fairly straight forward to
> > > > > > demonstrate
> > > > > > > > right?  Could we just collaborate on a few minimal Dockerfiles
> > that
> > > > > > show
> > > > > > > > (or don't show) Intel OpenMP performance speedups with the
> > > > workloads
> > > > > > Chris
> > > > > > > > is referencing?
> > > > > > > >
> > > > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> > > > > > > > stanislav.tsukrov@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Hi, Chris!
> > > > > > > > >
> > > > > > > > > Stas here - I've gathered that performance data.
> > > > > > > > > Sure thing, I can be wrong, but please elaborate a bit on
> > what
> > > > we are
> > > > > > > > > missing.
> > > > > > > > > Be assured, intentional misdirection was never a case.
> > > > > > > > >
> > > > > > > > > Thanks a lot for being constructive.
> > > > > > > > >
> > > > > > > > > > Turning Intel OMP on and off (and MKL as well, since it
> > tends
> > > > to
> > > > > > pull
> > > > > > > > in
> > > > > > > > > omp, depending which one is linked in).
> > > > > > > > >
> > > > > > > > > We never ever considered turning MKL off. We are on the same
> > page
> > > > > > here -
> > > > > > > > > MKL is crucial for the performance.
> > > > > > > > > Why should we? There's a GOMP-linked version of MKL, that we
> > can
> > > > use.
> > > > > > > > >
> > > > > > > > > What we did - we measured, if using compilers default OpenMP
> > > > > > > > > implementation instead of referenced source code
> > distribution of
> > > > > > OpenMP
> > > > > > > > > makes anything slower.
> > > > > > > > > We have found the impact to be hardly measurable.
> > > > > > > > > The difference between GOMP and iOMP is <5% on our
> > benchmarks,
> > > > most
> > > > > > of
> > > > > > > > the
> > > > > > > > > time less than that.
> > > > > > > > >
> > > > > > > > > We just suggest to simplify the build of mxnet, by removing
> > the
> > > > > > > > > unnecessary dependency.
> > > > > > > > >
> > > > > > > > > During that we discovered for example the following amazing
> > > > issue:
> > > > > > > > > https://github.com/apache/incubator-mxnet/issues/14087
> > > > > > > > >
> > > > > > > > > Best Regards
> > > > > > > > >
> > > > > > > > > Stas
> > > > > > > > >
> > > > > > > > > On 18.06.19, 18:24, "Chris Olivier" <cj...@gmail.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > >     I am very reluctant to feed the trolls again, and this
> > will
> > > > be
> > > > > > teh
> > > > > > > > last
> > > > > > > > >     time I address Pedro or Anton on the subject, but since I
> > > > think
> > > > > > the
> > > > > > > > > numbers
> > > > > > > > >     being presented are incorrect (either by te builders not
> > > > really
> > > > > > > > >     understanding what they are building, or possibly
> > intentional
> > > > > > > > > misdirection):
> > > > > > > > >
> > > > > > > > >     Turning Intel OMP on and off (and MKL as well, since it
> > > > tends to
> > > > > > pull
> > > > > > > > > in
> > > > > > > > >     omp, depending which one is linked in).
> > > > > > > > >     There is a HUGE difference.  This is consistent with my
> > > > > > experience
> > > > > > > > > before
> > > > > > > > >     when it was added.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >     default mnist:
> > > > > > > > >
> > > > > > > > >     python ../example/image-classification/train_mnist.py
> > > > > > > > >     INFO:root:start with arguments Namespace(add_stn=False,
> > > > > > > > batch_size=64,
> > > > > > > > >     disp_batches=100, dtype='float32', gc_threshold=0.5,
> > > > > > gc_type='none',
> > > > > > > > >     gpus=None, image_shape='1, 28, 28',
> > initializer='default',
> > > > > > > > >     kv_store='device', load_epoch=None, loss='', lr=0.05,
> > > > > > lr_factor=0.1,
> > > > > > > > >     lr_step_epochs='10', macrobatch_size=0,
> > model_prefix=None,
> > > > > > mom=0.9,
> > > > > > > > >     monitor=0, network='mlp', num_classes=10, num_epochs=20,
> > > > > > > > >     num_examples=60000, num_layers=None, optimizer='sgd',
> > > > > > > > >     profile_server_suffix='', profile_worker_suffix='',
> > > > > > save_period=1,
> > > > > > > > >     test_io=0, top_k=0, warmup_epochs=5,
> > > > warmup_strategy='linear',
> > > > > > > > > wd=0.0001)
> > > > > > > > >
> > > > > > > > >     INTEL OMP:
> > > > > > > > >
> > > > > > > > >     ldd libmxnet.so | grep omp
> > > > > > > > >             libomp.so =>
> > > > > > > > >
> > > > > >
> > /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > > > > > > > >     (0x00007f978fde7000)
> > > > > > > > >
> > > > > > > > >     :root:Epoch[0] Batch [0-100]        Speed: 31548.09
> > > > samples/sec
> > > > > > > > >     accuracy=0.780012
> > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21
> > > > > > samples/sec
> > > > > > > > >     accuracy=0.920469
> > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91
> > > > > > samples/sec
> > > > > > > > >     accuracy=0.928281
> > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36
> > > > > > samples/sec
> > > > > > > > >     accuracy=0.942813
> > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79
> > > > > > samples/sec
> > > > > > > > >     accuracy=0.938750
> > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52
> > > > > > samples/sec
> > > > > > > > >     accuracy=0.946562
> > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41
> > > > > > samples/sec
> > > > > > > > >     accuracy=0.953281
> > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73
> > > > > > samples/sec
> > > > > > > > >     accuracy=0.951562
> > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88
> > > > > > samples/sec
> > > > > > > > >     accuracy=0.957500
> > > > > > > > >     INFO:root:Epoch[0] Train-accuracy=0.925423
> > > > > > > > >     INFO:root:Epoch[0] Time cost=3.806
> > > > > > > > >     INFO:root:Epoch[0] Validation-accuracy=0.962580
> > > > > > > > >     INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21
> > > > > > samples/sec
> > > > > > > > >     accuracy=0.968131
> > > > > > > > >     INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03
> > > > > > samples/sec
> > > > > > > > >     accuracy=0.966250
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >     LIBGOMP:
> > > > > > > > >
> > > > > > > > >     ldd libmxnet.so | grep omp
> > > > > > > > >             libgomp.so.1 =>
> > > > /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > > > > >     (0x00007f25c25dd000)
> > > > > > > > >
> > > > > > > > >     INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01
> > > > > > samples/sec
> > > > > > > > >      accuracy=0.782488
> > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32
> > > > > > samples/sec
> > > > > > > > >      accuracy=0.907813
> > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00
> > > > > > samples/sec
> > > > > > > > >      accuracy=0.927188
> > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45
> > > > > > samples/sec
> > > > > > > > >      accuracy=0.937969
> > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95
> > > > > > samples/sec
> > > > > > > > >      accuracy=0.942187
> > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58
> > > > > > samples/sec
> > > > > > > > >      accuracy=0.950156
> > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16
> > > > > > samples/sec
> > > > > > > > >      accuracy=0.947969
> > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07
> > > > > > samples/sec
> > > > > > > > >      accuracy=0.953750
> > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58
> > > > > > samples/sec
> > > > > > > > >     accuracy=0.953125
> > > > > > > > >
> > > > > > > > >     That being said, there's other issued beyond speed.  The
> > > > DEFAULT
> > > > > > > > build
> > > > > > > > > from
> > > > > > > > >     makefile (not CMake) uses Intel OMP mkl (I showed
> > before) and
> > > > > > > > > mysteriously
> > > > > > > > >     it has no issues?  This seems highly suspicious.  All I
> > see
> > > > is a
> > > > > > lot
> > > > > > > > of
> > > > > > > > >     hand-waving and conjecture and pointing to StackOverflow
> > > > posts
> > > > > > made
> > > > > > > > by
> > > > > > > > >     people who may be of questionable pedigree to begin with.
> > > > This
> > > > > > > > smells
> > > > > > > > > of a
> > > > > > > > >     Pedro-ego-fight rather than one of purely technical
> > merit.
> > > > > > Also, if
> > > > > > > > > one
> > > > > > > > >     knows how OMP works,  they would be very suspicious of
> > the
> > > > > > > > > "intermittent
> > > > > > > > >     hangs" claim -- that's probably just broken race
> > conditions
> > > > > > elsewhere
> > > > > > > > > until
> > > > > > > > >     proven differently.  It'd tend freeze on the first use if
> > > > > > something
> > > > > > > > is
> > > > > > > > >     wrong (try using libgomp after a fork and see), since
> > worker
> > > > > > threads"
> > > > > > > > >     wouldn't be assigned/joined properly.  IntelOMP is
> > faster,
> > > > but
> > > > > > also
> > > > > > > > has
> > > > > > > > >     other advantages, such as allowing OMP after a fork.
> > > > > > > > >
> > > > > > > > >     I actually addressed a lot of issues and ask for
> > > > clarification
> > > > > > in the
> > > > > > > > >     original PR's way back when, but they're all just
> > ignored.
> > > > > > > > >
> > > > > > > > >     -Chris
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >

Re: OMP

Posted by Chris Olivier <cj...@gmail.com>.
What’s the reason for the assertion failure? btw classifying an assertion
failure a “crash” is debatable. As I stated in the original issue a long
time ago, it’s possible something shady is being done with when forking
that should be fixed.  The assertion should be root caused.



On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <pe...@gmail.com>
wrote:

> Added a dockerfile, and reports of a crash in my local machine when
> running MKL+OMP+DEBUG, with Anton's branch the crash happened as well.
> I couldn't reproduce the crash on my EC2 machine:
> Added the backtrace of the crash as well.
>
> https://github.com/apache/incubator-mxnet/issues/10856
>
> Dockerfile here:
>
> https://github.com/larroy/mxnet_omp
>
> Kind regards.
>
> Pedro.
>
> On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <ma...@gmail.com>
> wrote:
> >
> > As already proposed, I think the easiest way to get a common
> understanding
> > is if we start with a few docker containers. Pedro, would it be possible
> > for you to wrap your benchmarks into a few containers that will produce
> > your shown results? That way, we can avoid possible misunderstandings and
> > also pinpoint the exact parts where people disagree or misunderstood each
> > other.
> >
> > -Marco
> >
> > Pedro Larroy <pe...@gmail.com> schrieb am Do., 20. Juni
> 2019,
> > 21:47:
> >
> > > I can confirm that we are linking with two versions of omp, I'm
> > > gaining more clarity into this topic, but I have still questions, the
> > > facts that I got so far are the folllowing:
> > >
> > > * #1: We are linking with two versions of omp, intel's omp and llvm
> > > openmp when building with MKL enabled.
> > > * #2: We have 3 different possible OMP versions: Intel OMP (comes with
> > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
> > > one is used on the PR proposed by Anton).
> > >
> > > Questions:
> > >
> > >  * #1 Is it ok to have two versions of openmp linked at the same time?
> > >  * #2 Which implementation of OMP gives the best performance?  (See
> > > total training time of my measurement for a partial answer)
> > >  * #3 Should we have a build flag so we can choose the OMP version at
> > > runtime?
> > >  * #4 Which Compiler and build flags did Chris use to get 10x slowdown?
> > >  * #5 @Stas: is there a script to replicate your benchmarks easily? If
> > > so could you provide a link?  I think we would need to reproduce your
> > > benchmarks and verify which versions are being linked. It's possible
> > > that while compiling with MKL intel's omp was pulled in instead of
> > > GNU OpenMP.
> > >  * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we
> > > update the subrepo regularly?
> > >
> > > My conclusion so far:
> > >
> > >  * #1 We should avoid linking two versions of omp if possible and
> > > allow users to choose one in the build as we do for BLAS.
> > >  * #2 For performance reasons and more control vs different compiler
> > > versions seems it makes indeed sense to keep the LLVM OpenMP version
> > > in 3rdparty for now. So unless some more data is gathered, it makes
> > > sense not to remove it as of now.
> > >  * #3 We should provide build options to choose which openmp library
> > > is to be used from the three options available, including libgomp.
> > >  * #4 Refining the build we could also enable OpenMP in mac without
> > > additional contortions (doesn't work as of today):
> > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
> > >  * #5 We should add different omp versions to our benchmarks and track
> > > the performance, so this data is available for prescribing the best
> > > build options and for binary releases.
> > >
> > > This is also an interesting related gh issue posted in the mkl-dnn
> > > repository:  https://github.com/intel/mkl-dnn/issues/230
> > >
> > >
> > > I don't observe the order of magnitude divergence reported by Chris in
> > > vanilla Ubuntu 18.04 in samples / s but the full training finishes
> > > indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp.
> > >
> > > There's also differences in training time when using MKL and the ,
> > > it's actually a bit slower, I don't know if it's related to OMP.
> > >
> > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
> > >
> > > Anton's branch:  git@github.com:lebeg/incubator-mxnet.git   branch
> 'omp'
> > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> > > build/libmxnet.so |grep -i omp
> > >         libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > (0x00007fd99a51d000)
> > >
> > > time python train_mnist.py
> > >
> > > INFO:root:Epoch[18] Validation-accuracy=0.984176
> > > INFO:root:Epoch[19] Batch [0-100]       Speed: 41617.00 samples/sec
> > >  accuracy=1.000000
> > > INFO:root:Epoch[19] Batch [100-200]     Speed: 47990.69 samples/sec
> > >  accuracy=0.999531
> > > INFO:root:Epoch[19] Batch [200-300]     Speed: 47517.01 samples/sec
> > >  accuracy=0.999687
> > > INFO:root:Epoch[19] Batch [300-400]     Speed: 47430.53 samples/sec
> > >  accuracy=1.000000
> > > INFO:root:Epoch[19] Batch [400-500]     Speed: 47649.77 samples/sec
> > >  accuracy=0.999687
> > > INFO:root:Epoch[19] Batch [500-600]     Speed: 51708.12 samples/sec
> > >  accuracy=0.999687
> > > INFO:root:Epoch[19] Batch [600-700]     Speed: 57228.63 samples/sec
> > >  accuracy=0.999375
> > > INFO:root:Epoch[19] Batch [700-800]     Speed: 50887.85 samples/sec
> > >  accuracy=0.999844
> > > INFO:root:Epoch[19] Batch [800-900]     Speed: 53947.98 samples/sec
> > >  accuracy=0.999531
> > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > INFO:root:Epoch[19] Time cost=1.219
> > > INFO:root:Epoch[19] Validation-accuracy=0.983977
> > > 1011.98user 26.78system 0:31.54elapsed 3292%CPU (0avgtext+0avgdata
> > > 1146052maxresident)k
> > > 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps
> > >
> > > Master, MKL ON:
> > >
> > > (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification [master]> ldd
> > > ../../build/libmxnet.so | grep -i omp
> > >         libomp.so =>
> > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > (0x00007f05ba38f000)
> > >         libiomp5.so =>
> > >
> > >
> /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
> > > (0x00007f05b09f4000)
> > >
> > > INFO:root:Epoch[18] Validation-accuracy=0.982484
> > > INFO:root:Epoch[19] Batch [0-100]       Speed: 36651.63 samples/sec
> > >  accuracy=0.999691
> > > INFO:root:Epoch[19] Batch [100-200]     Speed: 45093.98 samples/sec
> > >  accuracy=0.999844
> > > INFO:root:Epoch[19] Batch [200-300]     Speed: 45146.84 samples/sec
> > >  accuracy=0.999687
> > > INFO:root:Epoch[19] Batch [300-400]     Speed: 45119.90 samples/sec
> > >  accuracy=0.999687
> > > INFO:root:Epoch[19] Batch [400-500]     Speed: 44998.96 samples/sec
> > >  accuracy=0.999531
> > > INFO:root:Epoch[19] Batch [500-600]     Speed: 45072.25 samples/sec
> > >  accuracy=0.999844
> > > INFO:root:Epoch[19] Batch [600-700]     Speed: 44969.79 samples/sec
> > >  accuracy=0.999844
> > > INFO:root:Epoch[19] Batch [700-800]     Speed: 44962.78 samples/sec
> > >  accuracy=0.999844
> > > INFO:root:Epoch[19] Batch [800-900]     Speed: 44945.47 samples/sec
> > >  accuracy=0.999375
> > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > INFO:root:Epoch[19] Time cost=1.367
> > > INFO:root:Epoch[19] Validation-accuracy=0.982783
> > > 854.97user 847.21system 0:41.44elapsed 4106%CPU (0avgtext+0avgdata
> > > 1154348maxresident)k
> > > 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps
> > >
> > >
> > > MKL OFF:
> > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> grep -i MKL
> > > cmake_options.yml
> > > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
> > > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
> > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
> > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
> > > build/libmxnet.so |grep -i omp
> > >         libomp.so =>
> > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > (0x00007fb720c54000)
> > >
> > > INFO:root:Epoch[18] Validation-accuracy=0.983479
> > > INFO:root:Epoch[19] Batch [0-100]       Speed: 46784.02 samples/sec
> > >  accuracy=1.000000
> > > INFO:root:Epoch[19] Batch [100-200]     Speed: 48824.29 samples/sec
> > >  accuracy=0.999687
> > > INFO:root:Epoch[19] Batch [200-300]     Speed: 49190.31 samples/sec
> > >  accuracy=0.999687
> > > INFO:root:Epoch[19] Batch [300-400]     Speed: 51518.77 samples/sec
> > >  accuracy=0.999844
> > > INFO:root:Epoch[19] Batch [400-500]     Speed: 51551.62 samples/sec
> > >  accuracy=0.999844
> > > INFO:root:Epoch[19] Batch [500-600]     Speed: 49026.35 samples/sec
> > >  accuracy=0.999844
> > > INFO:root:Epoch[19] Batch [600-700]     Speed: 49002.46 samples/sec
> > >  accuracy=0.999375
> > > INFO:root:Epoch[19] Batch [700-800]     Speed: 48980.55 samples/sec
> > >  accuracy=0.999687
> > > INFO:root:Epoch[19] Batch [800-900]     Speed: 47402.56 samples/sec
> > >  accuracy=0.999844
> > > INFO:root:Epoch[19] Train-accuracy=0.999767
> > > INFO:root:Epoch[19] Time cost=1.259
> > > INFO:root:Epoch[19] Validation-accuracy=0.983181
> > > 755.36user 754.94system 0:35.89elapsed 4207%CPU (0avgtext+0avgdata
> > > 1147008maxresident)k
> > > 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps
> > >
> > > Let me know what you think.
> > >
> > > Link to the original PR:
> > > https://github.com/apache/incubator-mxnet/pull/12160
> > >
> > > Thanks.
> > >
> > > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
> > > <ke...@gmail.com> wrote:
> > > >
> > > > "if you’re linking in two then you’re doing something wrong."
> Correct,
> > > > that's one thing I believe we've got consensus on.  So let's call
> that
> > > out
> > > > as a bug to be fixed.
> > > >
> > > > Let's move forward with some reproducible numbers and then discuss
> the
> > > pros
> > > > / cons of which particular OMP implementation we should use.
> > > >
> > > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <
> > > pedro.larroy.lists@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Chris
> > > > >
> > > > > I would ask you to have a bit of patience and help us with your
> > > > > experience in this matter. Nobody is ignoring anything, I think we
> are
> > > > > individually gathering feedbacks and trying to understand the
> multiple
> > > > > contributions done to this topic including yours, then go step by
> > > > > step, understand what is going on and run experiments and report
> back
> > > > > to the list or the corresponding github item. It was suggested by
> > > > > Kellen to prepare some containers, this takes effort.
> > > > >
> > > > > Regarding your final comment, most of us also have many other
> things
> > > > > to do and responsibilities even if our daytime jobs might involve
> > > > > MXNet in some form or another. I think that's part of the privilege
> > > > > and responsibility of working close with an open source project and
> > > > > the magic of collaboration across organizations. Let's all be
> patient
> > > > > and take some time to understand and reason about this topic which
> is
> > > > > not simple. Since we decided to step back and gather more data
> let's
> > > > > take time and do it properly.
> > > > >
> > > > > Personally I hope to find time to look again into this issue before
> > > > > the end of the week.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > Pedro.
> > > > >
> > > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <
> cjolivier01@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > if you’re linking in two then you’re doing something wrong. You
> can
> > > see
> > > > > by
> > > > > > my email yesterday that only one is linked in. This is also the
> case
> > > with
> > > > > > the mkl version built by the Makefile — only the Intel OMP
> library is
> > > > > used
> > > > > > (no libgomp).
> > > > > >
> > > > > > That being said, Do you have clear evidence that using Intel OMP
> is
> > > both
> > > > > > problematic and the situation isn’t fixable?  The burden of
> proof is
> > > on
> > > > > the
> > > > > > ones requesting the change — it is not my responsibility to
> justify
> > > the
> > > > > > current state.  There must be something “terrible” and unfixable
> to
> > > > > justify
> > > > > > a change.  I have seen no proof of this in all this time.
> > > > > >
> > > > > > On a side note, I mentioned a couple of things in my email
> yesterday
> > > that
> > > > > > still are not being responded to (they were also ignored in the
> last
> > > > > > incarnation of this “discussion” — I have much experience in this
> > > matter
> > > > > to
> > > > > > assume “discussion” is a waste of my time, seeing and I am not
> paid
> > > to
> > > > > > “work on” mxnet like y’all are).
> > > > > >
> > > > > > -C
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> > > > > > kellen.sunderland@gmail.com> wrote:
> > > > > >
> > > > > > > I've also quite often seen two versions of OpenMP linked.  I
> think
> > > we
> > > > > can
> > > > > > > all agree we probably want to avoid linking in two libraries
> that
> > > do
> > > > > > > effectively the same thing.
> > > > > > >
> > > > > > > The performance questions should be fairly straight forward to
> > > > > demonstrate
> > > > > > > right?  Could we just collaborate on a few minimal Dockerfiles
> that
> > > > > show
> > > > > > > (or don't show) Intel OpenMP performance speedups with the
> > > workloads
> > > > > Chris
> > > > > > > is referencing?
> > > > > > >
> > > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> > > > > > > stanislav.tsukrov@gmail.com> wrote:
> > > > > > >
> > > > > > > > Hi, Chris!
> > > > > > > >
> > > > > > > > Stas here - I've gathered that performance data.
> > > > > > > > Sure thing, I can be wrong, but please elaborate a bit on
> what
> > > we are
> > > > > > > > missing.
> > > > > > > > Be assured, intentional misdirection was never a case.
> > > > > > > >
> > > > > > > > Thanks a lot for being constructive.
> > > > > > > >
> > > > > > > > > Turning Intel OMP on and off (and MKL as well, since it
> tends
> > > to
> > > > > pull
> > > > > > > in
> > > > > > > > omp, depending which one is linked in).
> > > > > > > >
> > > > > > > > We never ever considered turning MKL off. We are on the same
> page
> > > > > here -
> > > > > > > > MKL is crucial for the performance.
> > > > > > > > Why should we? There's a GOMP-linked version of MKL, that we
> can
> > > use.
> > > > > > > >
> > > > > > > > What we did - we measured, if using compilers default OpenMP
> > > > > > > > implementation instead of referenced source code
> distribution of
> > > > > OpenMP
> > > > > > > > makes anything slower.
> > > > > > > > We have found the impact to be hardly measurable.
> > > > > > > > The difference between GOMP and iOMP is <5% on our
> benchmarks,
> > > most
> > > > > of
> > > > > > > the
> > > > > > > > time less than that.
> > > > > > > >
> > > > > > > > We just suggest to simplify the build of mxnet, by removing
> the
> > > > > > > > unnecessary dependency.
> > > > > > > >
> > > > > > > > During that we discovered for example the following amazing
> > > issue:
> > > > > > > > https://github.com/apache/incubator-mxnet/issues/14087
> > > > > > > >
> > > > > > > > Best Regards
> > > > > > > >
> > > > > > > > Stas
> > > > > > > >
> > > > > > > > On 18.06.19, 18:24, "Chris Olivier" <cj...@gmail.com>
> > > wrote:
> > > > > > > >
> > > > > > > >     I am very reluctant to feed the trolls again, and this
> will
> > > be
> > > > > teh
> > > > > > > last
> > > > > > > >     time I address Pedro or Anton on the subject, but since I
> > > think
> > > > > the
> > > > > > > > numbers
> > > > > > > >     being presented are incorrect (either by te builders not
> > > really
> > > > > > > >     understanding what they are building, or possibly
> intentional
> > > > > > > > misdirection):
> > > > > > > >
> > > > > > > >     Turning Intel OMP on and off (and MKL as well, since it
> > > tends to
> > > > > pull
> > > > > > > > in
> > > > > > > >     omp, depending which one is linked in).
> > > > > > > >     There is a HUGE difference.  This is consistent with my
> > > > > experience
> > > > > > > > before
> > > > > > > >     when it was added.
> > > > > > > >
> > > > > > > >
> > > > > > > >     default mnist:
> > > > > > > >
> > > > > > > >     python ../example/image-classification/train_mnist.py
> > > > > > > >     INFO:root:start with arguments Namespace(add_stn=False,
> > > > > > > batch_size=64,
> > > > > > > >     disp_batches=100, dtype='float32', gc_threshold=0.5,
> > > > > gc_type='none',
> > > > > > > >     gpus=None, image_shape='1, 28, 28',
> initializer='default',
> > > > > > > >     kv_store='device', load_epoch=None, loss='', lr=0.05,
> > > > > lr_factor=0.1,
> > > > > > > >     lr_step_epochs='10', macrobatch_size=0,
> model_prefix=None,
> > > > > mom=0.9,
> > > > > > > >     monitor=0, network='mlp', num_classes=10, num_epochs=20,
> > > > > > > >     num_examples=60000, num_layers=None, optimizer='sgd',
> > > > > > > >     profile_server_suffix='', profile_worker_suffix='',
> > > > > save_period=1,
> > > > > > > >     test_io=0, top_k=0, warmup_epochs=5,
> > > warmup_strategy='linear',
> > > > > > > > wd=0.0001)
> > > > > > > >
> > > > > > > >     INTEL OMP:
> > > > > > > >
> > > > > > > >     ldd libmxnet.so | grep omp
> > > > > > > >             libomp.so =>
> > > > > > > >
> > > > >
> /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > > > > > > >     (0x00007f978fde7000)
> > > > > > > >
> > > > > > > >     :root:Epoch[0] Batch [0-100]        Speed: 31548.09
> > > samples/sec
> > > > > > > >     accuracy=0.780012
> > > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21
> > > > > samples/sec
> > > > > > > >     accuracy=0.920469
> > > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91
> > > > > samples/sec
> > > > > > > >     accuracy=0.928281
> > > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36
> > > > > samples/sec
> > > > > > > >     accuracy=0.942813
> > > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79
> > > > > samples/sec
> > > > > > > >     accuracy=0.938750
> > > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52
> > > > > samples/sec
> > > > > > > >     accuracy=0.946562
> > > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41
> > > > > samples/sec
> > > > > > > >     accuracy=0.953281
> > > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73
> > > > > samples/sec
> > > > > > > >     accuracy=0.951562
> > > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88
> > > > > samples/sec
> > > > > > > >     accuracy=0.957500
> > > > > > > >     INFO:root:Epoch[0] Train-accuracy=0.925423
> > > > > > > >     INFO:root:Epoch[0] Time cost=3.806
> > > > > > > >     INFO:root:Epoch[0] Validation-accuracy=0.962580
> > > > > > > >     INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21
> > > > > samples/sec
> > > > > > > >     accuracy=0.968131
> > > > > > > >     INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03
> > > > > samples/sec
> > > > > > > >     accuracy=0.966250
> > > > > > > >
> > > > > > > >
> > > > > > > >     LIBGOMP:
> > > > > > > >
> > > > > > > >     ldd libmxnet.so | grep omp
> > > > > > > >             libgomp.so.1 =>
> > > /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > > > >     (0x00007f25c25dd000)
> > > > > > > >
> > > > > > > >     INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01
> > > > > samples/sec
> > > > > > > >      accuracy=0.782488
> > > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32
> > > > > samples/sec
> > > > > > > >      accuracy=0.907813
> > > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00
> > > > > samples/sec
> > > > > > > >      accuracy=0.927188
> > > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45
> > > > > samples/sec
> > > > > > > >      accuracy=0.937969
> > > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95
> > > > > samples/sec
> > > > > > > >      accuracy=0.942187
> > > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58
> > > > > samples/sec
> > > > > > > >      accuracy=0.950156
> > > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16
> > > > > samples/sec
> > > > > > > >      accuracy=0.947969
> > > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07
> > > > > samples/sec
> > > > > > > >      accuracy=0.953750
> > > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58
> > > > > samples/sec
> > > > > > > >     accuracy=0.953125
> > > > > > > >
> > > > > > > >     That being said, there's other issued beyond speed.  The
> > > DEFAULT
> > > > > > > build
> > > > > > > > from
> > > > > > > >     makefile (not CMake) uses Intel OMP mkl (I showed
> before) and
> > > > > > > > mysteriously
> > > > > > > >     it has no issues?  This seems highly suspicious.  All I
> see
> > > is a
> > > > > lot
> > > > > > > of
> > > > > > > >     hand-waving and conjecture and pointing to StackOverflow
> > > posts
> > > > > made
> > > > > > > by
> > > > > > > >     people who may be of questionable pedigree to begin with.
> > > This
> > > > > > > smells
> > > > > > > > of a
> > > > > > > >     Pedro-ego-fight rather than one of purely technical
> merit.
> > > > > Also, if
> > > > > > > > one
> > > > > > > >     knows how OMP works,  they would be very suspicious of
> the
> > > > > > > > "intermittent
> > > > > > > >     hangs" claim -- that's probably just broken race
> conditions
> > > > > elsewhere
> > > > > > > > until
> > > > > > > >     proven differently.  It'd tend freeze on the first use if
> > > > > something
> > > > > > > is
> > > > > > > >     wrong (try using libgomp after a fork and see), since
> worker
> > > > > threads"
> > > > > > > >     wouldn't be assigned/joined properly.  IntelOMP is
> faster,
> > > but
> > > > > also
> > > > > > > has
> > > > > > > >     other advantages, such as allowing OMP after a fork.
> > > > > > > >
> > > > > > > >     I actually addressed a lot of issues and ask for
> > > clarification
> > > > > in the
> > > > > > > >     original PR's way back when, but they're all just
> ignored.
> > > > > > > >
> > > > > > > >     -Chris
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
>

Re: OMP

Posted by Pedro Larroy <pe...@gmail.com>.
Added a dockerfile, and reports of a crash in my local machine when
running MKL+OMP+DEBUG, with Anton's branch the crash happened as well.
I couldn't reproduce the crash on my EC2 machine:
Added the backtrace of the crash as well.

https://github.com/apache/incubator-mxnet/issues/10856

Dockerfile here:

https://github.com/larroy/mxnet_omp

Kind regards.

Pedro.

On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <ma...@gmail.com> wrote:
>
> As already proposed, I think the easiest way to get a common understanding
> is if we start with a few docker containers. Pedro, would it be possible
> for you to wrap your benchmarks into a few containers that will produce
> your shown results? That way, we can avoid possible misunderstandings and
> also pinpoint the exact parts where people disagree or misunderstood each
> other.
>
> -Marco
>
> Pedro Larroy <pe...@gmail.com> schrieb am Do., 20. Juni 2019,
> 21:47:
>
> > I can confirm that we are linking with two versions of omp, I'm
> > gaining more clarity into this topic, but I have still questions, the
> > facts that I got so far are the folllowing:
> >
> > * #1: We are linking with two versions of omp, intel's omp and llvm
> > openmp when building with MKL enabled.
> > * #2: We have 3 different possible OMP versions: Intel OMP (comes with
> > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
> > one is used on the PR proposed by Anton).
> >
> > Questions:
> >
> >  * #1 Is it ok to have two versions of openmp linked at the same time?
> >  * #2 Which implementation of OMP gives the best performance?  (See
> > total training time of my measurement for a partial answer)
> >  * #3 Should we have a build flag so we can choose the OMP version at
> > runtime?
> >  * #4 Which Compiler and build flags did Chris use to get 10x slowdown?
> >  * #5 @Stas: is there a script to replicate your benchmarks easily? If
> > so could you provide a link?  I think we would need to reproduce your
> > benchmarks and verify which versions are being linked. It's possible
> > that while compiling with MKL intel's omp was pulled in instead of
> > GNU OpenMP.
> >  * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we
> > update the subrepo regularly?
> >
> > My conclusion so far:
> >
> >  * #1 We should avoid linking two versions of omp if possible and
> > allow users to choose one in the build as we do for BLAS.
> >  * #2 For performance reasons and more control vs different compiler
> > versions seems it makes indeed sense to keep the LLVM OpenMP version
> > in 3rdparty for now. So unless some more data is gathered, it makes
> > sense not to remove it as of now.
> >  * #3 We should provide build options to choose which openmp library
> > is to be used from the three options available, including libgomp.
> >  * #4 Refining the build we could also enable OpenMP in mac without
> > additional contortions (doesn't work as of today):
> > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
> >  * #5 We should add different omp versions to our benchmarks and track
> > the performance, so this data is available for prescribing the best
> > build options and for binary releases.
> >
> > This is also an interesting related gh issue posted in the mkl-dnn
> > repository:  https://github.com/intel/mkl-dnn/issues/230
> >
> >
> > I don't observe the order of magnitude divergence reported by Chris in
> > vanilla Ubuntu 18.04 in samples / s but the full training finishes
> > indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp.
> >
> > There's also differences in training time when using MKL and the ,
> > it's actually a bit slower, I don't know if it's related to OMP.
> >
> > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
> >
> > Anton's branch:  git@github.com:lebeg/incubator-mxnet.git   branch 'omp'
> > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> > build/libmxnet.so |grep -i omp
> >         libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > (0x00007fd99a51d000)
> >
> > time python train_mnist.py
> >
> > INFO:root:Epoch[18] Validation-accuracy=0.984176
> > INFO:root:Epoch[19] Batch [0-100]       Speed: 41617.00 samples/sec
> >  accuracy=1.000000
> > INFO:root:Epoch[19] Batch [100-200]     Speed: 47990.69 samples/sec
> >  accuracy=0.999531
> > INFO:root:Epoch[19] Batch [200-300]     Speed: 47517.01 samples/sec
> >  accuracy=0.999687
> > INFO:root:Epoch[19] Batch [300-400]     Speed: 47430.53 samples/sec
> >  accuracy=1.000000
> > INFO:root:Epoch[19] Batch [400-500]     Speed: 47649.77 samples/sec
> >  accuracy=0.999687
> > INFO:root:Epoch[19] Batch [500-600]     Speed: 51708.12 samples/sec
> >  accuracy=0.999687
> > INFO:root:Epoch[19] Batch [600-700]     Speed: 57228.63 samples/sec
> >  accuracy=0.999375
> > INFO:root:Epoch[19] Batch [700-800]     Speed: 50887.85 samples/sec
> >  accuracy=0.999844
> > INFO:root:Epoch[19] Batch [800-900]     Speed: 53947.98 samples/sec
> >  accuracy=0.999531
> > INFO:root:Epoch[19] Train-accuracy=0.999717
> > INFO:root:Epoch[19] Time cost=1.219
> > INFO:root:Epoch[19] Validation-accuracy=0.983977
> > 1011.98user 26.78system 0:31.54elapsed 3292%CPU (0avgtext+0avgdata
> > 1146052maxresident)k
> > 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps
> >
> > Master, MKL ON:
> >
> > (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification [master]> ldd
> > ../../build/libmxnet.so | grep -i omp
> >         libomp.so =>
> > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > (0x00007f05ba38f000)
> >         libiomp5.so =>
> >
> > /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
> > (0x00007f05b09f4000)
> >
> > INFO:root:Epoch[18] Validation-accuracy=0.982484
> > INFO:root:Epoch[19] Batch [0-100]       Speed: 36651.63 samples/sec
> >  accuracy=0.999691
> > INFO:root:Epoch[19] Batch [100-200]     Speed: 45093.98 samples/sec
> >  accuracy=0.999844
> > INFO:root:Epoch[19] Batch [200-300]     Speed: 45146.84 samples/sec
> >  accuracy=0.999687
> > INFO:root:Epoch[19] Batch [300-400]     Speed: 45119.90 samples/sec
> >  accuracy=0.999687
> > INFO:root:Epoch[19] Batch [400-500]     Speed: 44998.96 samples/sec
> >  accuracy=0.999531
> > INFO:root:Epoch[19] Batch [500-600]     Speed: 45072.25 samples/sec
> >  accuracy=0.999844
> > INFO:root:Epoch[19] Batch [600-700]     Speed: 44969.79 samples/sec
> >  accuracy=0.999844
> > INFO:root:Epoch[19] Batch [700-800]     Speed: 44962.78 samples/sec
> >  accuracy=0.999844
> > INFO:root:Epoch[19] Batch [800-900]     Speed: 44945.47 samples/sec
> >  accuracy=0.999375
> > INFO:root:Epoch[19] Train-accuracy=0.999717
> > INFO:root:Epoch[19] Time cost=1.367
> > INFO:root:Epoch[19] Validation-accuracy=0.982783
> > 854.97user 847.21system 0:41.44elapsed 4106%CPU (0avgtext+0avgdata
> > 1154348maxresident)k
> > 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps
> >
> >
> > MKL OFF:
> > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> grep -i MKL
> > cmake_options.yml
> > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
> > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
> > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
> > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
> > build/libmxnet.so |grep -i omp
> >         libomp.so =>
> > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > (0x00007fb720c54000)
> >
> > INFO:root:Epoch[18] Validation-accuracy=0.983479
> > INFO:root:Epoch[19] Batch [0-100]       Speed: 46784.02 samples/sec
> >  accuracy=1.000000
> > INFO:root:Epoch[19] Batch [100-200]     Speed: 48824.29 samples/sec
> >  accuracy=0.999687
> > INFO:root:Epoch[19] Batch [200-300]     Speed: 49190.31 samples/sec
> >  accuracy=0.999687
> > INFO:root:Epoch[19] Batch [300-400]     Speed: 51518.77 samples/sec
> >  accuracy=0.999844
> > INFO:root:Epoch[19] Batch [400-500]     Speed: 51551.62 samples/sec
> >  accuracy=0.999844
> > INFO:root:Epoch[19] Batch [500-600]     Speed: 49026.35 samples/sec
> >  accuracy=0.999844
> > INFO:root:Epoch[19] Batch [600-700]     Speed: 49002.46 samples/sec
> >  accuracy=0.999375
> > INFO:root:Epoch[19] Batch [700-800]     Speed: 48980.55 samples/sec
> >  accuracy=0.999687
> > INFO:root:Epoch[19] Batch [800-900]     Speed: 47402.56 samples/sec
> >  accuracy=0.999844
> > INFO:root:Epoch[19] Train-accuracy=0.999767
> > INFO:root:Epoch[19] Time cost=1.259
> > INFO:root:Epoch[19] Validation-accuracy=0.983181
> > 755.36user 754.94system 0:35.89elapsed 4207%CPU (0avgtext+0avgdata
> > 1147008maxresident)k
> > 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps
> >
> > Let me know what you think.
> >
> > Link to the original PR:
> > https://github.com/apache/incubator-mxnet/pull/12160
> >
> > Thanks.
> >
> > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
> > <ke...@gmail.com> wrote:
> > >
> > > "if you’re linking in two then you’re doing something wrong." Correct,
> > > that's one thing I believe we've got consensus on.  So let's call that
> > out
> > > as a bug to be fixed.
> > >
> > > Let's move forward with some reproducible numbers and then discuss the
> > pros
> > > / cons of which particular OMP implementation we should use.
> > >
> > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <
> > pedro.larroy.lists@gmail.com>
> > > wrote:
> > >
> > > > Hi Chris
> > > >
> > > > I would ask you to have a bit of patience and help us with your
> > > > experience in this matter. Nobody is ignoring anything, I think we are
> > > > individually gathering feedbacks and trying to understand the multiple
> > > > contributions done to this topic including yours, then go step by
> > > > step, understand what is going on and run experiments and report back
> > > > to the list or the corresponding github item. It was suggested by
> > > > Kellen to prepare some containers, this takes effort.
> > > >
> > > > Regarding your final comment, most of us also have many other things
> > > > to do and responsibilities even if our daytime jobs might involve
> > > > MXNet in some form or another. I think that's part of the privilege
> > > > and responsibility of working close with an open source project and
> > > > the magic of collaboration across organizations. Let's all be patient
> > > > and take some time to understand and reason about this topic which is
> > > > not simple. Since we decided to step back and gather more data let's
> > > > take time and do it properly.
> > > >
> > > > Personally I hope to find time to look again into this issue before
> > > > the end of the week.
> > > >
> > > > Thanks.
> > > >
> > > > Pedro.
> > > >
> > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <cj...@apache.org>
> > > > wrote:
> > > > >
> > > > > if you’re linking in two then you’re doing something wrong. You can
> > see
> > > > by
> > > > > my email yesterday that only one is linked in. This is also the case
> > with
> > > > > the mkl version built by the Makefile — only the Intel OMP library is
> > > > used
> > > > > (no libgomp).
> > > > >
> > > > > That being said, Do you have clear evidence that using Intel OMP is
> > both
> > > > > problematic and the situation isn’t fixable?  The burden of proof is
> > on
> > > > the
> > > > > ones requesting the change — it is not my responsibility to justify
> > the
> > > > > current state.  There must be something “terrible” and unfixable to
> > > > justify
> > > > > a change.  I have seen no proof of this in all this time.
> > > > >
> > > > > On a side note, I mentioned a couple of things in my email yesterday
> > that
> > > > > still are not being responded to (they were also ignored in the last
> > > > > incarnation of this “discussion” — I have much experience in this
> > matter
> > > > to
> > > > > assume “discussion” is a waste of my time, seeing and I am not paid
> > to
> > > > > “work on” mxnet like y’all are).
> > > > >
> > > > > -C
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> > > > > kellen.sunderland@gmail.com> wrote:
> > > > >
> > > > > > I've also quite often seen two versions of OpenMP linked.  I think
> > we
> > > > can
> > > > > > all agree we probably want to avoid linking in two libraries that
> > do
> > > > > > effectively the same thing.
> > > > > >
> > > > > > The performance questions should be fairly straight forward to
> > > > demonstrate
> > > > > > right?  Could we just collaborate on a few minimal Dockerfiles that
> > > > show
> > > > > > (or don't show) Intel OpenMP performance speedups with the
> > workloads
> > > > Chris
> > > > > > is referencing?
> > > > > >
> > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> > > > > > stanislav.tsukrov@gmail.com> wrote:
> > > > > >
> > > > > > > Hi, Chris!
> > > > > > >
> > > > > > > Stas here - I've gathered that performance data.
> > > > > > > Sure thing, I can be wrong, but please elaborate a bit on what
> > we are
> > > > > > > missing.
> > > > > > > Be assured, intentional misdirection was never a case.
> > > > > > >
> > > > > > > Thanks a lot for being constructive.
> > > > > > >
> > > > > > > > Turning Intel OMP on and off (and MKL as well, since it tends
> > to
> > > > pull
> > > > > > in
> > > > > > > omp, depending which one is linked in).
> > > > > > >
> > > > > > > We never ever considered turning MKL off. We are on the same page
> > > > here -
> > > > > > > MKL is crucial for the performance.
> > > > > > > Why should we? There's a GOMP-linked version of MKL, that we can
> > use.
> > > > > > >
> > > > > > > What we did - we measured, if using compilers default OpenMP
> > > > > > > implementation instead of referenced source code distribution of
> > > > OpenMP
> > > > > > > makes anything slower.
> > > > > > > We have found the impact to be hardly measurable.
> > > > > > > The difference between GOMP and iOMP is <5% on our benchmarks,
> > most
> > > > of
> > > > > > the
> > > > > > > time less than that.
> > > > > > >
> > > > > > > We just suggest to simplify the build of mxnet, by removing the
> > > > > > > unnecessary dependency.
> > > > > > >
> > > > > > > During that we discovered for example the following amazing
> > issue:
> > > > > > > https://github.com/apache/incubator-mxnet/issues/14087
> > > > > > >
> > > > > > > Best Regards
> > > > > > >
> > > > > > > Stas
> > > > > > >
> > > > > > > On 18.06.19, 18:24, "Chris Olivier" <cj...@gmail.com>
> > wrote:
> > > > > > >
> > > > > > >     I am very reluctant to feed the trolls again, and this will
> > be
> > > > teh
> > > > > > last
> > > > > > >     time I address Pedro or Anton on the subject, but since I
> > think
> > > > the
> > > > > > > numbers
> > > > > > >     being presented are incorrect (either by te builders not
> > really
> > > > > > >     understanding what they are building, or possibly intentional
> > > > > > > misdirection):
> > > > > > >
> > > > > > >     Turning Intel OMP on and off (and MKL as well, since it
> > tends to
> > > > pull
> > > > > > > in
> > > > > > >     omp, depending which one is linked in).
> > > > > > >     There is a HUGE difference.  This is consistent with my
> > > > experience
> > > > > > > before
> > > > > > >     when it was added.
> > > > > > >
> > > > > > >
> > > > > > >     default mnist:
> > > > > > >
> > > > > > >     python ../example/image-classification/train_mnist.py
> > > > > > >     INFO:root:start with arguments Namespace(add_stn=False,
> > > > > > batch_size=64,
> > > > > > >     disp_batches=100, dtype='float32', gc_threshold=0.5,
> > > > gc_type='none',
> > > > > > >     gpus=None, image_shape='1, 28, 28', initializer='default',
> > > > > > >     kv_store='device', load_epoch=None, loss='', lr=0.05,
> > > > lr_factor=0.1,
> > > > > > >     lr_step_epochs='10', macrobatch_size=0, model_prefix=None,
> > > > mom=0.9,
> > > > > > >     monitor=0, network='mlp', num_classes=10, num_epochs=20,
> > > > > > >     num_examples=60000, num_layers=None, optimizer='sgd',
> > > > > > >     profile_server_suffix='', profile_worker_suffix='',
> > > > save_period=1,
> > > > > > >     test_io=0, top_k=0, warmup_epochs=5,
> > warmup_strategy='linear',
> > > > > > > wd=0.0001)
> > > > > > >
> > > > > > >     INTEL OMP:
> > > > > > >
> > > > > > >     ldd libmxnet.so | grep omp
> > > > > > >             libomp.so =>
> > > > > > >
> > > >  /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > > > > > >     (0x00007f978fde7000)
> > > > > > >
> > > > > > >     :root:Epoch[0] Batch [0-100]        Speed: 31548.09
> > samples/sec
> > > > > > >     accuracy=0.780012
> > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21
> > > > samples/sec
> > > > > > >     accuracy=0.920469
> > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91
> > > > samples/sec
> > > > > > >     accuracy=0.928281
> > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36
> > > > samples/sec
> > > > > > >     accuracy=0.942813
> > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79
> > > > samples/sec
> > > > > > >     accuracy=0.938750
> > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52
> > > > samples/sec
> > > > > > >     accuracy=0.946562
> > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41
> > > > samples/sec
> > > > > > >     accuracy=0.953281
> > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73
> > > > samples/sec
> > > > > > >     accuracy=0.951562
> > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88
> > > > samples/sec
> > > > > > >     accuracy=0.957500
> > > > > > >     INFO:root:Epoch[0] Train-accuracy=0.925423
> > > > > > >     INFO:root:Epoch[0] Time cost=3.806
> > > > > > >     INFO:root:Epoch[0] Validation-accuracy=0.962580
> > > > > > >     INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21
> > > > samples/sec
> > > > > > >     accuracy=0.968131
> > > > > > >     INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03
> > > > samples/sec
> > > > > > >     accuracy=0.966250
> > > > > > >
> > > > > > >
> > > > > > >     LIBGOMP:
> > > > > > >
> > > > > > >     ldd libmxnet.so | grep omp
> > > > > > >             libgomp.so.1 =>
> > /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > > >     (0x00007f25c25dd000)
> > > > > > >
> > > > > > >     INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01
> > > > samples/sec
> > > > > > >      accuracy=0.782488
> > > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32
> > > > samples/sec
> > > > > > >      accuracy=0.907813
> > > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00
> > > > samples/sec
> > > > > > >      accuracy=0.927188
> > > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45
> > > > samples/sec
> > > > > > >      accuracy=0.937969
> > > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95
> > > > samples/sec
> > > > > > >      accuracy=0.942187
> > > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58
> > > > samples/sec
> > > > > > >      accuracy=0.950156
> > > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16
> > > > samples/sec
> > > > > > >      accuracy=0.947969
> > > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07
> > > > samples/sec
> > > > > > >      accuracy=0.953750
> > > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58
> > > > samples/sec
> > > > > > >     accuracy=0.953125
> > > > > > >
> > > > > > >     That being said, there's other issued beyond speed.  The
> > DEFAULT
> > > > > > build
> > > > > > > from
> > > > > > >     makefile (not CMake) uses Intel OMP mkl (I showed before) and
> > > > > > > mysteriously
> > > > > > >     it has no issues?  This seems highly suspicious.  All I see
> > is a
> > > > lot
> > > > > > of
> > > > > > >     hand-waving and conjecture and pointing to StackOverflow
> > posts
> > > > made
> > > > > > by
> > > > > > >     people who may be of questionable pedigree to begin with.
> > This
> > > > > > smells
> > > > > > > of a
> > > > > > >     Pedro-ego-fight rather than one of purely technical merit.
> > > > Also, if
> > > > > > > one
> > > > > > >     knows how OMP works,  they would be very suspicious of the
> > > > > > > "intermittent
> > > > > > >     hangs" claim -- that's probably just broken race conditions
> > > > elsewhere
> > > > > > > until
> > > > > > >     proven differently.  It'd tend freeze on the first use if
> > > > something
> > > > > > is
> > > > > > >     wrong (try using libgomp after a fork and see), since worker
> > > > threads"
> > > > > > >     wouldn't be assigned/joined properly.  IntelOMP is faster,
> > but
> > > > also
> > > > > > has
> > > > > > >     other advantages, such as allowing OMP after a fork.
> > > > > > >
> > > > > > >     I actually addressed a lot of issues and ask for
> > clarification
> > > > in the
> > > > > > >     original PR's way back when, but they're all just ignored.
> > > > > > >
> > > > > > >     -Chris
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> >

Re: OMP

Posted by Marco de Abreu <ma...@gmail.com>.
As already proposed, I think the easiest way to get a common understanding
is if we start with a few docker containers. Pedro, would it be possible
for you to wrap your benchmarks into a few containers that will produce
your shown results? That way, we can avoid possible misunderstandings and
also pinpoint the exact parts where people disagree or misunderstood each
other.

-Marco

Pedro Larroy <pe...@gmail.com> schrieb am Do., 20. Juni 2019,
21:47:

> I can confirm that we are linking with two versions of omp, I'm
> gaining more clarity into this topic, but I have still questions, the
> facts that I got so far are the folllowing:
>
> * #1: We are linking with two versions of omp, intel's omp and llvm
> openmp when building with MKL enabled.
> * #2: We have 3 different possible OMP versions: Intel OMP (comes with
> MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
> one is used on the PR proposed by Anton).
>
> Questions:
>
>  * #1 Is it ok to have two versions of openmp linked at the same time?
>  * #2 Which implementation of OMP gives the best performance?  (See
> total training time of my measurement for a partial answer)
>  * #3 Should we have a build flag so we can choose the OMP version at
> runtime?
>  * #4 Which Compiler and build flags did Chris use to get 10x slowdown?
>  * #5 @Stas: is there a script to replicate your benchmarks easily? If
> so could you provide a link?  I think we would need to reproduce your
> benchmarks and verify which versions are being linked. It's possible
> that while compiling with MKL intel's omp was pulled in instead of
> GNU OpenMP.
>  * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we
> update the subrepo regularly?
>
> My conclusion so far:
>
>  * #1 We should avoid linking two versions of omp if possible and
> allow users to choose one in the build as we do for BLAS.
>  * #2 For performance reasons and more control vs different compiler
> versions seems it makes indeed sense to keep the LLVM OpenMP version
> in 3rdparty for now. So unless some more data is gathered, it makes
> sense not to remove it as of now.
>  * #3 We should provide build options to choose which openmp library
> is to be used from the three options available, including libgomp.
>  * #4 Refining the build we could also enable OpenMP in mac without
> additional contortions (doesn't work as of today):
> https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
>  * #5 We should add different omp versions to our benchmarks and track
> the performance, so this data is available for prescribing the best
> build options and for binary releases.
>
> This is also an interesting related gh issue posted in the mkl-dnn
> repository:  https://github.com/intel/mkl-dnn/issues/230
>
>
> I don't observe the order of magnitude divergence reported by Chris in
> vanilla Ubuntu 18.04 in samples / s but the full training finishes
> indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp.
>
> There's also differences in training time when using MKL and the ,
> it's actually a bit slower, I don't know if it's related to OMP.
>
> gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
>
> Anton's branch:  git@github.com:lebeg/incubator-mxnet.git   branch 'omp'
> (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> build/libmxnet.so |grep -i omp
>         libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> (0x00007fd99a51d000)
>
> time python train_mnist.py
>
> INFO:root:Epoch[18] Validation-accuracy=0.984176
> INFO:root:Epoch[19] Batch [0-100]       Speed: 41617.00 samples/sec
>  accuracy=1.000000
> INFO:root:Epoch[19] Batch [100-200]     Speed: 47990.69 samples/sec
>  accuracy=0.999531
> INFO:root:Epoch[19] Batch [200-300]     Speed: 47517.01 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [300-400]     Speed: 47430.53 samples/sec
>  accuracy=1.000000
> INFO:root:Epoch[19] Batch [400-500]     Speed: 47649.77 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [500-600]     Speed: 51708.12 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [600-700]     Speed: 57228.63 samples/sec
>  accuracy=0.999375
> INFO:root:Epoch[19] Batch [700-800]     Speed: 50887.85 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Batch [800-900]     Speed: 53947.98 samples/sec
>  accuracy=0.999531
> INFO:root:Epoch[19] Train-accuracy=0.999717
> INFO:root:Epoch[19] Time cost=1.219
> INFO:root:Epoch[19] Validation-accuracy=0.983977
> 1011.98user 26.78system 0:31.54elapsed 3292%CPU (0avgtext+0avgdata
> 1146052maxresident)k
> 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps
>
> Master, MKL ON:
>
> (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification [master]> ldd
> ../../build/libmxnet.so | grep -i omp
>         libomp.so =>
> /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> (0x00007f05ba38f000)
>         libiomp5.so =>
>
> /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
> (0x00007f05b09f4000)
>
> INFO:root:Epoch[18] Validation-accuracy=0.982484
> INFO:root:Epoch[19] Batch [0-100]       Speed: 36651.63 samples/sec
>  accuracy=0.999691
> INFO:root:Epoch[19] Batch [100-200]     Speed: 45093.98 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Batch [200-300]     Speed: 45146.84 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [300-400]     Speed: 45119.90 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [400-500]     Speed: 44998.96 samples/sec
>  accuracy=0.999531
> INFO:root:Epoch[19] Batch [500-600]     Speed: 45072.25 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Batch [600-700]     Speed: 44969.79 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Batch [700-800]     Speed: 44962.78 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Batch [800-900]     Speed: 44945.47 samples/sec
>  accuracy=0.999375
> INFO:root:Epoch[19] Train-accuracy=0.999717
> INFO:root:Epoch[19] Time cost=1.367
> INFO:root:Epoch[19] Validation-accuracy=0.982783
> 854.97user 847.21system 0:41.44elapsed 4106%CPU (0avgtext+0avgdata
> 1154348maxresident)k
> 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps
>
>
> MKL OFF:
> (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> grep -i MKL
> cmake_options.yml
> USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
> USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
> USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
> USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
> build/libmxnet.so |grep -i omp
>         libomp.so =>
> /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> (0x00007fb720c54000)
>
> INFO:root:Epoch[18] Validation-accuracy=0.983479
> INFO:root:Epoch[19] Batch [0-100]       Speed: 46784.02 samples/sec
>  accuracy=1.000000
> INFO:root:Epoch[19] Batch [100-200]     Speed: 48824.29 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [200-300]     Speed: 49190.31 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [300-400]     Speed: 51518.77 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Batch [400-500]     Speed: 51551.62 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Batch [500-600]     Speed: 49026.35 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Batch [600-700]     Speed: 49002.46 samples/sec
>  accuracy=0.999375
> INFO:root:Epoch[19] Batch [700-800]     Speed: 48980.55 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [800-900]     Speed: 47402.56 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Train-accuracy=0.999767
> INFO:root:Epoch[19] Time cost=1.259
> INFO:root:Epoch[19] Validation-accuracy=0.983181
> 755.36user 754.94system 0:35.89elapsed 4207%CPU (0avgtext+0avgdata
> 1147008maxresident)k
> 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps
>
> Let me know what you think.
>
> Link to the original PR:
> https://github.com/apache/incubator-mxnet/pull/12160
>
> Thanks.
>
> On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
> <ke...@gmail.com> wrote:
> >
> > "if you’re linking in two then you’re doing something wrong." Correct,
> > that's one thing I believe we've got consensus on.  So let's call that
> out
> > as a bug to be fixed.
> >
> > Let's move forward with some reproducible numbers and then discuss the
> pros
> > / cons of which particular OMP implementation we should use.
> >
> > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <
> pedro.larroy.lists@gmail.com>
> > wrote:
> >
> > > Hi Chris
> > >
> > > I would ask you to have a bit of patience and help us with your
> > > experience in this matter. Nobody is ignoring anything, I think we are
> > > individually gathering feedbacks and trying to understand the multiple
> > > contributions done to this topic including yours, then go step by
> > > step, understand what is going on and run experiments and report back
> > > to the list or the corresponding github item. It was suggested by
> > > Kellen to prepare some containers, this takes effort.
> > >
> > > Regarding your final comment, most of us also have many other things
> > > to do and responsibilities even if our daytime jobs might involve
> > > MXNet in some form or another. I think that's part of the privilege
> > > and responsibility of working close with an open source project and
> > > the magic of collaboration across organizations. Let's all be patient
> > > and take some time to understand and reason about this topic which is
> > > not simple. Since we decided to step back and gather more data let's
> > > take time and do it properly.
> > >
> > > Personally I hope to find time to look again into this issue before
> > > the end of the week.
> > >
> > > Thanks.
> > >
> > > Pedro.
> > >
> > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <cj...@apache.org>
> > > wrote:
> > > >
> > > > if you’re linking in two then you’re doing something wrong. You can
> see
> > > by
> > > > my email yesterday that only one is linked in. This is also the case
> with
> > > > the mkl version built by the Makefile — only the Intel OMP library is
> > > used
> > > > (no libgomp).
> > > >
> > > > That being said, Do you have clear evidence that using Intel OMP is
> both
> > > > problematic and the situation isn’t fixable?  The burden of proof is
> on
> > > the
> > > > ones requesting the change — it is not my responsibility to justify
> the
> > > > current state.  There must be something “terrible” and unfixable to
> > > justify
> > > > a change.  I have seen no proof of this in all this time.
> > > >
> > > > On a side note, I mentioned a couple of things in my email yesterday
> that
> > > > still are not being responded to (they were also ignored in the last
> > > > incarnation of this “discussion” — I have much experience in this
> matter
> > > to
> > > > assume “discussion” is a waste of my time, seeing and I am not paid
> to
> > > > “work on” mxnet like y’all are).
> > > >
> > > > -C
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> > > > kellen.sunderland@gmail.com> wrote:
> > > >
> > > > > I've also quite often seen two versions of OpenMP linked.  I think
> we
> > > can
> > > > > all agree we probably want to avoid linking in two libraries that
> do
> > > > > effectively the same thing.
> > > > >
> > > > > The performance questions should be fairly straight forward to
> > > demonstrate
> > > > > right?  Could we just collaborate on a few minimal Dockerfiles that
> > > show
> > > > > (or don't show) Intel OpenMP performance speedups with the
> workloads
> > > Chris
> > > > > is referencing?
> > > > >
> > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> > > > > stanislav.tsukrov@gmail.com> wrote:
> > > > >
> > > > > > Hi, Chris!
> > > > > >
> > > > > > Stas here - I've gathered that performance data.
> > > > > > Sure thing, I can be wrong, but please elaborate a bit on what
> we are
> > > > > > missing.
> > > > > > Be assured, intentional misdirection was never a case.
> > > > > >
> > > > > > Thanks a lot for being constructive.
> > > > > >
> > > > > > > Turning Intel OMP on and off (and MKL as well, since it tends
> to
> > > pull
> > > > > in
> > > > > > omp, depending which one is linked in).
> > > > > >
> > > > > > We never ever considered turning MKL off. We are on the same page
> > > here -
> > > > > > MKL is crucial for the performance.
> > > > > > Why should we? There's a GOMP-linked version of MKL, that we can
> use.
> > > > > >
> > > > > > What we did - we measured, if using compilers default OpenMP
> > > > > > implementation instead of referenced source code distribution of
> > > OpenMP
> > > > > > makes anything slower.
> > > > > > We have found the impact to be hardly measurable.
> > > > > > The difference between GOMP and iOMP is <5% on our benchmarks,
> most
> > > of
> > > > > the
> > > > > > time less than that.
> > > > > >
> > > > > > We just suggest to simplify the build of mxnet, by removing the
> > > > > > unnecessary dependency.
> > > > > >
> > > > > > During that we discovered for example the following amazing
> issue:
> > > > > > https://github.com/apache/incubator-mxnet/issues/14087
> > > > > >
> > > > > > Best Regards
> > > > > >
> > > > > > Stas
> > > > > >
> > > > > > On 18.06.19, 18:24, "Chris Olivier" <cj...@gmail.com>
> wrote:
> > > > > >
> > > > > >     I am very reluctant to feed the trolls again, and this will
> be
> > > teh
> > > > > last
> > > > > >     time I address Pedro or Anton on the subject, but since I
> think
> > > the
> > > > > > numbers
> > > > > >     being presented are incorrect (either by te builders not
> really
> > > > > >     understanding what they are building, or possibly intentional
> > > > > > misdirection):
> > > > > >
> > > > > >     Turning Intel OMP on and off (and MKL as well, since it
> tends to
> > > pull
> > > > > > in
> > > > > >     omp, depending which one is linked in).
> > > > > >     There is a HUGE difference.  This is consistent with my
> > > experience
> > > > > > before
> > > > > >     when it was added.
> > > > > >
> > > > > >
> > > > > >     default mnist:
> > > > > >
> > > > > >     python ../example/image-classification/train_mnist.py
> > > > > >     INFO:root:start with arguments Namespace(add_stn=False,
> > > > > batch_size=64,
> > > > > >     disp_batches=100, dtype='float32', gc_threshold=0.5,
> > > gc_type='none',
> > > > > >     gpus=None, image_shape='1, 28, 28', initializer='default',
> > > > > >     kv_store='device', load_epoch=None, loss='', lr=0.05,
> > > lr_factor=0.1,
> > > > > >     lr_step_epochs='10', macrobatch_size=0, model_prefix=None,
> > > mom=0.9,
> > > > > >     monitor=0, network='mlp', num_classes=10, num_epochs=20,
> > > > > >     num_examples=60000, num_layers=None, optimizer='sgd',
> > > > > >     profile_server_suffix='', profile_worker_suffix='',
> > > save_period=1,
> > > > > >     test_io=0, top_k=0, warmup_epochs=5,
> warmup_strategy='linear',
> > > > > > wd=0.0001)
> > > > > >
> > > > > >     INTEL OMP:
> > > > > >
> > > > > >     ldd libmxnet.so | grep omp
> > > > > >             libomp.so =>
> > > > > >
> > >  /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > > > > >     (0x00007f978fde7000)
> > > > > >
> > > > > >     :root:Epoch[0] Batch [0-100]        Speed: 31548.09
> samples/sec
> > > > > >     accuracy=0.780012
> > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21
> > > samples/sec
> > > > > >     accuracy=0.920469
> > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91
> > > samples/sec
> > > > > >     accuracy=0.928281
> > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36
> > > samples/sec
> > > > > >     accuracy=0.942813
> > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79
> > > samples/sec
> > > > > >     accuracy=0.938750
> > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52
> > > samples/sec
> > > > > >     accuracy=0.946562
> > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41
> > > samples/sec
> > > > > >     accuracy=0.953281
> > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73
> > > samples/sec
> > > > > >     accuracy=0.951562
> > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88
> > > samples/sec
> > > > > >     accuracy=0.957500
> > > > > >     INFO:root:Epoch[0] Train-accuracy=0.925423
> > > > > >     INFO:root:Epoch[0] Time cost=3.806
> > > > > >     INFO:root:Epoch[0] Validation-accuracy=0.962580
> > > > > >     INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21
> > > samples/sec
> > > > > >     accuracy=0.968131
> > > > > >     INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03
> > > samples/sec
> > > > > >     accuracy=0.966250
> > > > > >
> > > > > >
> > > > > >     LIBGOMP:
> > > > > >
> > > > > >     ldd libmxnet.so | grep omp
> > > > > >             libgomp.so.1 =>
> /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > >     (0x00007f25c25dd000)
> > > > > >
> > > > > >     INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01
> > > samples/sec
> > > > > >      accuracy=0.782488
> > > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32
> > > samples/sec
> > > > > >      accuracy=0.907813
> > > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00
> > > samples/sec
> > > > > >      accuracy=0.927188
> > > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45
> > > samples/sec
> > > > > >      accuracy=0.937969
> > > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95
> > > samples/sec
> > > > > >      accuracy=0.942187
> > > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58
> > > samples/sec
> > > > > >      accuracy=0.950156
> > > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16
> > > samples/sec
> > > > > >      accuracy=0.947969
> > > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07
> > > samples/sec
> > > > > >      accuracy=0.953750
> > > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58
> > > samples/sec
> > > > > >     accuracy=0.953125
> > > > > >
> > > > > >     That being said, there's other issued beyond speed.  The
> DEFAULT
> > > > > build
> > > > > > from
> > > > > >     makefile (not CMake) uses Intel OMP mkl (I showed before) and
> > > > > > mysteriously
> > > > > >     it has no issues?  This seems highly suspicious.  All I see
> is a
> > > lot
> > > > > of
> > > > > >     hand-waving and conjecture and pointing to StackOverflow
> posts
> > > made
> > > > > by
> > > > > >     people who may be of questionable pedigree to begin with.
> This
> > > > > smells
> > > > > > of a
> > > > > >     Pedro-ego-fight rather than one of purely technical merit.
> > > Also, if
> > > > > > one
> > > > > >     knows how OMP works,  they would be very suspicious of the
> > > > > > "intermittent
> > > > > >     hangs" claim -- that's probably just broken race conditions
> > > elsewhere
> > > > > > until
> > > > > >     proven differently.  It'd tend freeze on the first use if
> > > something
> > > > > is
> > > > > >     wrong (try using libgomp after a fork and see), since worker
> > > threads"
> > > > > >     wouldn't be assigned/joined properly.  IntelOMP is faster,
> but
> > > also
> > > > > has
> > > > > >     other advantages, such as allowing OMP after a fork.
> > > > > >
> > > > > >     I actually addressed a lot of issues and ask for
> clarification
> > > in the
> > > > > >     original PR's way back when, but they're all just ignored.
> > > > > >
> > > > > >     -Chris
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > >
>

Re: OMP

Posted by Pedro Larroy <pe...@gmail.com>.
I can confirm that we are linking with two versions of omp, I'm
gaining more clarity into this topic, but I have still questions, the
facts that I got so far are the folllowing:

* #1: We are linking with two versions of omp, intel's omp and llvm
openmp when building with MKL enabled.
* #2: We have 3 different possible OMP versions: Intel OMP (comes with
MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
one is used on the PR proposed by Anton).

Questions:

 * #1 Is it ok to have two versions of openmp linked at the same time?
 * #2 Which implementation of OMP gives the best performance?  (See
total training time of my measurement for a partial answer)
 * #3 Should we have a build flag so we can choose the OMP version at runtime?
 * #4 Which Compiler and build flags did Chris use to get 10x slowdown?
 * #5 @Stas: is there a script to replicate your benchmarks easily? If
so could you provide a link?  I think we would need to reproduce your
benchmarks and verify which versions are being linked. It's possible
that while compiling with MKL intel's omp was pulled in instead of
GNU OpenMP.
 * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we
update the subrepo regularly?

My conclusion so far:

 * #1 We should avoid linking two versions of omp if possible and
allow users to choose one in the build as we do for BLAS.
 * #2 For performance reasons and more control vs different compiler
versions seems it makes indeed sense to keep the LLVM OpenMP version
in 3rdparty for now. So unless some more data is gathered, it makes
sense not to remove it as of now.
 * #3 We should provide build options to choose which openmp library
is to be used from the three options available, including libgomp.
 * #4 Refining the build we could also enable OpenMP in mac without
additional contortions (doesn't work as of today):
https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
 * #5 We should add different omp versions to our benchmarks and track
the performance, so this data is available for prescribing the best
build options and for binary releases.

This is also an interesting related gh issue posted in the mkl-dnn
repository:  https://github.com/intel/mkl-dnn/issues/230


I don't observe the order of magnitude divergence reported by Chris in
vanilla Ubuntu 18.04 in samples / s but the full training finishes
indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp.

There's also differences in training time when using MKL and the ,
it's actually a bit slower, I don't know if it's related to OMP.

gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)

Anton's branch:  git@github.com:lebeg/incubator-mxnet.git   branch 'omp'
(py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
build/libmxnet.so |grep -i omp
        libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
(0x00007fd99a51d000)

time python train_mnist.py

INFO:root:Epoch[18] Validation-accuracy=0.984176
INFO:root:Epoch[19] Batch [0-100]       Speed: 41617.00 samples/sec
 accuracy=1.000000
INFO:root:Epoch[19] Batch [100-200]     Speed: 47990.69 samples/sec
 accuracy=0.999531
INFO:root:Epoch[19] Batch [200-300]     Speed: 47517.01 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [300-400]     Speed: 47430.53 samples/sec
 accuracy=1.000000
INFO:root:Epoch[19] Batch [400-500]     Speed: 47649.77 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [500-600]     Speed: 51708.12 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [600-700]     Speed: 57228.63 samples/sec
 accuracy=0.999375
INFO:root:Epoch[19] Batch [700-800]     Speed: 50887.85 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [800-900]     Speed: 53947.98 samples/sec
 accuracy=0.999531
INFO:root:Epoch[19] Train-accuracy=0.999717
INFO:root:Epoch[19] Time cost=1.219
INFO:root:Epoch[19] Validation-accuracy=0.983977
1011.98user 26.78system 0:31.54elapsed 3292%CPU (0avgtext+0avgdata
1146052maxresident)k
0inputs+0outputs (0major+3496364minor)pagefaults 0swaps

Master, MKL ON:

(py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification [master]> ldd
../../build/libmxnet.so | grep -i omp
        libomp.so =>
/home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
(0x00007f05ba38f000)
        libiomp5.so =>
/home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
(0x00007f05b09f4000)

INFO:root:Epoch[18] Validation-accuracy=0.982484
INFO:root:Epoch[19] Batch [0-100]       Speed: 36651.63 samples/sec
 accuracy=0.999691
INFO:root:Epoch[19] Batch [100-200]     Speed: 45093.98 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [200-300]     Speed: 45146.84 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [300-400]     Speed: 45119.90 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [400-500]     Speed: 44998.96 samples/sec
 accuracy=0.999531
INFO:root:Epoch[19] Batch [500-600]     Speed: 45072.25 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [600-700]     Speed: 44969.79 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [700-800]     Speed: 44962.78 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [800-900]     Speed: 44945.47 samples/sec
 accuracy=0.999375
INFO:root:Epoch[19] Train-accuracy=0.999717
INFO:root:Epoch[19] Time cost=1.367
INFO:root:Epoch[19] Validation-accuracy=0.982783
854.97user 847.21system 0:41.44elapsed 4106%CPU (0avgtext+0avgdata
1154348maxresident)k
0inputs+0outputs (0major+3624361minor)pagefaults 0swaps


MKL OFF:
(py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> grep -i MKL
cmake_options.yml
USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
USE_MKL_IF_AVAILABLE AND (NOT APPLE)
USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
USE_MKL_IF_AVAILABLE AND (NOT APPLE)
(py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
build/libmxnet.so |grep -i omp
        libomp.so =>
/home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
(0x00007fb720c54000)

INFO:root:Epoch[18] Validation-accuracy=0.983479
INFO:root:Epoch[19] Batch [0-100]       Speed: 46784.02 samples/sec
 accuracy=1.000000
INFO:root:Epoch[19] Batch [100-200]     Speed: 48824.29 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [200-300]     Speed: 49190.31 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [300-400]     Speed: 51518.77 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [400-500]     Speed: 51551.62 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [500-600]     Speed: 49026.35 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [600-700]     Speed: 49002.46 samples/sec
 accuracy=0.999375
INFO:root:Epoch[19] Batch [700-800]     Speed: 48980.55 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [800-900]     Speed: 47402.56 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Train-accuracy=0.999767
INFO:root:Epoch[19] Time cost=1.259
INFO:root:Epoch[19] Validation-accuracy=0.983181
755.36user 754.94system 0:35.89elapsed 4207%CPU (0avgtext+0avgdata
1147008maxresident)k
0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps

Let me know what you think.

Link to the original PR: https://github.com/apache/incubator-mxnet/pull/12160

Thanks.

On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
<ke...@gmail.com> wrote:
>
> "if you’re linking in two then you’re doing something wrong." Correct,
> that's one thing I believe we've got consensus on.  So let's call that out
> as a bug to be fixed.
>
> Let's move forward with some reproducible numbers and then discuss the pros
> / cons of which particular OMP implementation we should use.
>
> On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <pe...@gmail.com>
> wrote:
>
> > Hi Chris
> >
> > I would ask you to have a bit of patience and help us with your
> > experience in this matter. Nobody is ignoring anything, I think we are
> > individually gathering feedbacks and trying to understand the multiple
> > contributions done to this topic including yours, then go step by
> > step, understand what is going on and run experiments and report back
> > to the list or the corresponding github item. It was suggested by
> > Kellen to prepare some containers, this takes effort.
> >
> > Regarding your final comment, most of us also have many other things
> > to do and responsibilities even if our daytime jobs might involve
> > MXNet in some form or another. I think that's part of the privilege
> > and responsibility of working close with an open source project and
> > the magic of collaboration across organizations. Let's all be patient
> > and take some time to understand and reason about this topic which is
> > not simple. Since we decided to step back and gather more data let's
> > take time and do it properly.
> >
> > Personally I hope to find time to look again into this issue before
> > the end of the week.
> >
> > Thanks.
> >
> > Pedro.
> >
> > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <cj...@apache.org>
> > wrote:
> > >
> > > if you’re linking in two then you’re doing something wrong. You can see
> > by
> > > my email yesterday that only one is linked in. This is also the case with
> > > the mkl version built by the Makefile — only the Intel OMP library is
> > used
> > > (no libgomp).
> > >
> > > That being said, Do you have clear evidence that using Intel OMP is both
> > > problematic and the situation isn’t fixable?  The burden of proof is on
> > the
> > > ones requesting the change — it is not my responsibility to justify the
> > > current state.  There must be something “terrible” and unfixable to
> > justify
> > > a change.  I have seen no proof of this in all this time.
> > >
> > > On a side note, I mentioned a couple of things in my email yesterday that
> > > still are not being responded to (they were also ignored in the last
> > > incarnation of this “discussion” — I have much experience in this matter
> > to
> > > assume “discussion” is a waste of my time, seeing and I am not paid to
> > > “work on” mxnet like y’all are).
> > >
> > > -C
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> > > kellen.sunderland@gmail.com> wrote:
> > >
> > > > I've also quite often seen two versions of OpenMP linked.  I think we
> > can
> > > > all agree we probably want to avoid linking in two libraries that do
> > > > effectively the same thing.
> > > >
> > > > The performance questions should be fairly straight forward to
> > demonstrate
> > > > right?  Could we just collaborate on a few minimal Dockerfiles that
> > show
> > > > (or don't show) Intel OpenMP performance speedups with the workloads
> > Chris
> > > > is referencing?
> > > >
> > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> > > > stanislav.tsukrov@gmail.com> wrote:
> > > >
> > > > > Hi, Chris!
> > > > >
> > > > > Stas here - I've gathered that performance data.
> > > > > Sure thing, I can be wrong, but please elaborate a bit on what we are
> > > > > missing.
> > > > > Be assured, intentional misdirection was never a case.
> > > > >
> > > > > Thanks a lot for being constructive.
> > > > >
> > > > > > Turning Intel OMP on and off (and MKL as well, since it tends to
> > pull
> > > > in
> > > > > omp, depending which one is linked in).
> > > > >
> > > > > We never ever considered turning MKL off. We are on the same page
> > here -
> > > > > MKL is crucial for the performance.
> > > > > Why should we? There's a GOMP-linked version of MKL, that we can use.
> > > > >
> > > > > What we did - we measured, if using compilers default OpenMP
> > > > > implementation instead of referenced source code distribution of
> > OpenMP
> > > > > makes anything slower.
> > > > > We have found the impact to be hardly measurable.
> > > > > The difference between GOMP and iOMP is <5% on our benchmarks, most
> > of
> > > > the
> > > > > time less than that.
> > > > >
> > > > > We just suggest to simplify the build of mxnet, by removing the
> > > > > unnecessary dependency.
> > > > >
> > > > > During that we discovered for example the following amazing issue:
> > > > > https://github.com/apache/incubator-mxnet/issues/14087
> > > > >
> > > > > Best Regards
> > > > >
> > > > > Stas
> > > > >
> > > > > On 18.06.19, 18:24, "Chris Olivier" <cj...@gmail.com> wrote:
> > > > >
> > > > >     I am very reluctant to feed the trolls again, and this will be
> > teh
> > > > last
> > > > >     time I address Pedro or Anton on the subject, but since I think
> > the
> > > > > numbers
> > > > >     being presented are incorrect (either by te builders not really
> > > > >     understanding what they are building, or possibly intentional
> > > > > misdirection):
> > > > >
> > > > >     Turning Intel OMP on and off (and MKL as well, since it tends to
> > pull
> > > > > in
> > > > >     omp, depending which one is linked in).
> > > > >     There is a HUGE difference.  This is consistent with my
> > experience
> > > > > before
> > > > >     when it was added.
> > > > >
> > > > >
> > > > >     default mnist:
> > > > >
> > > > >     python ../example/image-classification/train_mnist.py
> > > > >     INFO:root:start with arguments Namespace(add_stn=False,
> > > > batch_size=64,
> > > > >     disp_batches=100, dtype='float32', gc_threshold=0.5,
> > gc_type='none',
> > > > >     gpus=None, image_shape='1, 28, 28', initializer='default',
> > > > >     kv_store='device', load_epoch=None, loss='', lr=0.05,
> > lr_factor=0.1,
> > > > >     lr_step_epochs='10', macrobatch_size=0, model_prefix=None,
> > mom=0.9,
> > > > >     monitor=0, network='mlp', num_classes=10, num_epochs=20,
> > > > >     num_examples=60000, num_layers=None, optimizer='sgd',
> > > > >     profile_server_suffix='', profile_worker_suffix='',
> > save_period=1,
> > > > >     test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear',
> > > > > wd=0.0001)
> > > > >
> > > > >     INTEL OMP:
> > > > >
> > > > >     ldd libmxnet.so | grep omp
> > > > >             libomp.so =>
> > > > >
> >  /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > > > >     (0x00007f978fde7000)
> > > > >
> > > > >     :root:Epoch[0] Batch [0-100]        Speed: 31548.09 samples/sec
> > > > >     accuracy=0.780012
> > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21
> > samples/sec
> > > > >     accuracy=0.920469
> > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91
> > samples/sec
> > > > >     accuracy=0.928281
> > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36
> > samples/sec
> > > > >     accuracy=0.942813
> > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79
> > samples/sec
> > > > >     accuracy=0.938750
> > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52
> > samples/sec
> > > > >     accuracy=0.946562
> > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41
> > samples/sec
> > > > >     accuracy=0.953281
> > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73
> > samples/sec
> > > > >     accuracy=0.951562
> > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88
> > samples/sec
> > > > >     accuracy=0.957500
> > > > >     INFO:root:Epoch[0] Train-accuracy=0.925423
> > > > >     INFO:root:Epoch[0] Time cost=3.806
> > > > >     INFO:root:Epoch[0] Validation-accuracy=0.962580
> > > > >     INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21
> > samples/sec
> > > > >     accuracy=0.968131
> > > > >     INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03
> > samples/sec
> > > > >     accuracy=0.966250
> > > > >
> > > > >
> > > > >     LIBGOMP:
> > > > >
> > > > >     ldd libmxnet.so | grep omp
> > > > >             libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > >     (0x00007f25c25dd000)
> > > > >
> > > > >     INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01
> > samples/sec
> > > > >      accuracy=0.782488
> > > > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32
> > samples/sec
> > > > >      accuracy=0.907813
> > > > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00
> > samples/sec
> > > > >      accuracy=0.927188
> > > > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45
> > samples/sec
> > > > >      accuracy=0.937969
> > > > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95
> > samples/sec
> > > > >      accuracy=0.942187
> > > > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58
> > samples/sec
> > > > >      accuracy=0.950156
> > > > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16
> > samples/sec
> > > > >      accuracy=0.947969
> > > > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07
> > samples/sec
> > > > >      accuracy=0.953750
> > > > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58
> > samples/sec
> > > > >     accuracy=0.953125
> > > > >
> > > > >     That being said, there's other issued beyond speed.  The DEFAULT
> > > > build
> > > > > from
> > > > >     makefile (not CMake) uses Intel OMP mkl (I showed before) and
> > > > > mysteriously
> > > > >     it has no issues?  This seems highly suspicious.  All I see is a
> > lot
> > > > of
> > > > >     hand-waving and conjecture and pointing to StackOverflow posts
> > made
> > > > by
> > > > >     people who may be of questionable pedigree to begin with.  This
> > > > smells
> > > > > of a
> > > > >     Pedro-ego-fight rather than one of purely technical merit.
> > Also, if
> > > > > one
> > > > >     knows how OMP works,  they would be very suspicious of the
> > > > > "intermittent
> > > > >     hangs" claim -- that's probably just broken race conditions
> > elsewhere
> > > > > until
> > > > >     proven differently.  It'd tend freeze on the first use if
> > something
> > > > is
> > > > >     wrong (try using libgomp after a fork and see), since worker
> > threads"
> > > > >     wouldn't be assigned/joined properly.  IntelOMP is faster, but
> > also
> > > > has
> > > > >     other advantages, such as allowing OMP after a fork.
> > > > >
> > > > >     I actually addressed a lot of issues and ask for clarification
> > in the
> > > > >     original PR's way back when, but they're all just ignored.
> > > > >
> > > > >     -Chris
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> >

Re: OMP

Posted by kellen sunderland <ke...@gmail.com>.
"if you’re linking in two then you’re doing something wrong." Correct,
that's one thing I believe we've got consensus on.  So let's call that out
as a bug to be fixed.

Let's move forward with some reproducible numbers and then discuss the pros
/ cons of which particular OMP implementation we should use.

On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <pe...@gmail.com>
wrote:

> Hi Chris
>
> I would ask you to have a bit of patience and help us with your
> experience in this matter. Nobody is ignoring anything, I think we are
> individually gathering feedbacks and trying to understand the multiple
> contributions done to this topic including yours, then go step by
> step, understand what is going on and run experiments and report back
> to the list or the corresponding github item. It was suggested by
> Kellen to prepare some containers, this takes effort.
>
> Regarding your final comment, most of us also have many other things
> to do and responsibilities even if our daytime jobs might involve
> MXNet in some form or another. I think that's part of the privilege
> and responsibility of working close with an open source project and
> the magic of collaboration across organizations. Let's all be patient
> and take some time to understand and reason about this topic which is
> not simple. Since we decided to step back and gather more data let's
> take time and do it properly.
>
> Personally I hope to find time to look again into this issue before
> the end of the week.
>
> Thanks.
>
> Pedro.
>
> On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <cj...@apache.org>
> wrote:
> >
> > if you’re linking in two then you’re doing something wrong. You can see
> by
> > my email yesterday that only one is linked in. This is also the case with
> > the mkl version built by the Makefile — only the Intel OMP library is
> used
> > (no libgomp).
> >
> > That being said, Do you have clear evidence that using Intel OMP is both
> > problematic and the situation isn’t fixable?  The burden of proof is on
> the
> > ones requesting the change — it is not my responsibility to justify the
> > current state.  There must be something “terrible” and unfixable to
> justify
> > a change.  I have seen no proof of this in all this time.
> >
> > On a side note, I mentioned a couple of things in my email yesterday that
> > still are not being responded to (they were also ignored in the last
> > incarnation of this “discussion” — I have much experience in this matter
> to
> > assume “discussion” is a waste of my time, seeing and I am not paid to
> > “work on” mxnet like y’all are).
> >
> > -C
> >
> >
> >
> >
> >
> >
> > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> > kellen.sunderland@gmail.com> wrote:
> >
> > > I've also quite often seen two versions of OpenMP linked.  I think we
> can
> > > all agree we probably want to avoid linking in two libraries that do
> > > effectively the same thing.
> > >
> > > The performance questions should be fairly straight forward to
> demonstrate
> > > right?  Could we just collaborate on a few minimal Dockerfiles that
> show
> > > (or don't show) Intel OpenMP performance speedups with the workloads
> Chris
> > > is referencing?
> > >
> > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> > > stanislav.tsukrov@gmail.com> wrote:
> > >
> > > > Hi, Chris!
> > > >
> > > > Stas here - I've gathered that performance data.
> > > > Sure thing, I can be wrong, but please elaborate a bit on what we are
> > > > missing.
> > > > Be assured, intentional misdirection was never a case.
> > > >
> > > > Thanks a lot for being constructive.
> > > >
> > > > > Turning Intel OMP on and off (and MKL as well, since it tends to
> pull
> > > in
> > > > omp, depending which one is linked in).
> > > >
> > > > We never ever considered turning MKL off. We are on the same page
> here -
> > > > MKL is crucial for the performance.
> > > > Why should we? There's a GOMP-linked version of MKL, that we can use.
> > > >
> > > > What we did - we measured, if using compilers default OpenMP
> > > > implementation instead of referenced source code distribution of
> OpenMP
> > > > makes anything slower.
> > > > We have found the impact to be hardly measurable.
> > > > The difference between GOMP and iOMP is <5% on our benchmarks, most
> of
> > > the
> > > > time less than that.
> > > >
> > > > We just suggest to simplify the build of mxnet, by removing the
> > > > unnecessary dependency.
> > > >
> > > > During that we discovered for example the following amazing issue:
> > > > https://github.com/apache/incubator-mxnet/issues/14087
> > > >
> > > > Best Regards
> > > >
> > > > Stas
> > > >
> > > > On 18.06.19, 18:24, "Chris Olivier" <cj...@gmail.com> wrote:
> > > >
> > > >     I am very reluctant to feed the trolls again, and this will be
> teh
> > > last
> > > >     time I address Pedro or Anton on the subject, but since I think
> the
> > > > numbers
> > > >     being presented are incorrect (either by te builders not really
> > > >     understanding what they are building, or possibly intentional
> > > > misdirection):
> > > >
> > > >     Turning Intel OMP on and off (and MKL as well, since it tends to
> pull
> > > > in
> > > >     omp, depending which one is linked in).
> > > >     There is a HUGE difference.  This is consistent with my
> experience
> > > > before
> > > >     when it was added.
> > > >
> > > >
> > > >     default mnist:
> > > >
> > > >     python ../example/image-classification/train_mnist.py
> > > >     INFO:root:start with arguments Namespace(add_stn=False,
> > > batch_size=64,
> > > >     disp_batches=100, dtype='float32', gc_threshold=0.5,
> gc_type='none',
> > > >     gpus=None, image_shape='1, 28, 28', initializer='default',
> > > >     kv_store='device', load_epoch=None, loss='', lr=0.05,
> lr_factor=0.1,
> > > >     lr_step_epochs='10', macrobatch_size=0, model_prefix=None,
> mom=0.9,
> > > >     monitor=0, network='mlp', num_classes=10, num_epochs=20,
> > > >     num_examples=60000, num_layers=None, optimizer='sgd',
> > > >     profile_server_suffix='', profile_worker_suffix='',
> save_period=1,
> > > >     test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear',
> > > > wd=0.0001)
> > > >
> > > >     INTEL OMP:
> > > >
> > > >     ldd libmxnet.so | grep omp
> > > >             libomp.so =>
> > > >
>  /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > > >     (0x00007f978fde7000)
> > > >
> > > >     :root:Epoch[0] Batch [0-100]        Speed: 31548.09 samples/sec
> > > >     accuracy=0.780012
> > > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21
> samples/sec
> > > >     accuracy=0.920469
> > > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91
> samples/sec
> > > >     accuracy=0.928281
> > > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36
> samples/sec
> > > >     accuracy=0.942813
> > > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79
> samples/sec
> > > >     accuracy=0.938750
> > > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52
> samples/sec
> > > >     accuracy=0.946562
> > > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41
> samples/sec
> > > >     accuracy=0.953281
> > > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73
> samples/sec
> > > >     accuracy=0.951562
> > > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88
> samples/sec
> > > >     accuracy=0.957500
> > > >     INFO:root:Epoch[0] Train-accuracy=0.925423
> > > >     INFO:root:Epoch[0] Time cost=3.806
> > > >     INFO:root:Epoch[0] Validation-accuracy=0.962580
> > > >     INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21
> samples/sec
> > > >     accuracy=0.968131
> > > >     INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03
> samples/sec
> > > >     accuracy=0.966250
> > > >
> > > >
> > > >     LIBGOMP:
> > > >
> > > >     ldd libmxnet.so | grep omp
> > > >             libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > >     (0x00007f25c25dd000)
> > > >
> > > >     INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01
> samples/sec
> > > >      accuracy=0.782488
> > > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32
> samples/sec
> > > >      accuracy=0.907813
> > > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00
> samples/sec
> > > >      accuracy=0.927188
> > > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45
> samples/sec
> > > >      accuracy=0.937969
> > > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95
> samples/sec
> > > >      accuracy=0.942187
> > > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58
> samples/sec
> > > >      accuracy=0.950156
> > > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16
> samples/sec
> > > >      accuracy=0.947969
> > > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07
> samples/sec
> > > >      accuracy=0.953750
> > > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58
> samples/sec
> > > >     accuracy=0.953125
> > > >
> > > >     That being said, there's other issued beyond speed.  The DEFAULT
> > > build
> > > > from
> > > >     makefile (not CMake) uses Intel OMP mkl (I showed before) and
> > > > mysteriously
> > > >     it has no issues?  This seems highly suspicious.  All I see is a
> lot
> > > of
> > > >     hand-waving and conjecture and pointing to StackOverflow posts
> made
> > > by
> > > >     people who may be of questionable pedigree to begin with.  This
> > > smells
> > > > of a
> > > >     Pedro-ego-fight rather than one of purely technical merit.
> Also, if
> > > > one
> > > >     knows how OMP works,  they would be very suspicious of the
> > > > "intermittent
> > > >     hangs" claim -- that's probably just broken race conditions
> elsewhere
> > > > until
> > > >     proven differently.  It'd tend freeze on the first use if
> something
> > > is
> > > >     wrong (try using libgomp after a fork and see), since worker
> threads"
> > > >     wouldn't be assigned/joined properly.  IntelOMP is faster, but
> also
> > > has
> > > >     other advantages, such as allowing OMP after a fork.
> > > >
> > > >     I actually addressed a lot of issues and ask for clarification
> in the
> > > >     original PR's way back when, but they're all just ignored.
> > > >
> > > >     -Chris
> > > >
> > > >
> > > >
> > > >
> > >
>

Re: OMP

Posted by Pedro Larroy <pe...@gmail.com>.
Hi Chris

I would ask you to have a bit of patience and help us with your
experience in this matter. Nobody is ignoring anything, I think we are
individually gathering feedbacks and trying to understand the multiple
contributions done to this topic including yours, then go step by
step, understand what is going on and run experiments and report back
to the list or the corresponding github item. It was suggested by
Kellen to prepare some containers, this takes effort.

Regarding your final comment, most of us also have many other things
to do and responsibilities even if our daytime jobs might involve
MXNet in some form or another. I think that's part of the privilege
and responsibility of working close with an open source project and
the magic of collaboration across organizations. Let's all be patient
and take some time to understand and reason about this topic which is
not simple. Since we decided to step back and gather more data let's
take time and do it properly.

Personally I hope to find time to look again into this issue before
the end of the week.

Thanks.

Pedro.

On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <cj...@apache.org> wrote:
>
> if you’re linking in two then you’re doing something wrong. You can see by
> my email yesterday that only one is linked in. This is also the case with
> the mkl version built by the Makefile — only the Intel OMP library is used
> (no libgomp).
>
> That being said, Do you have clear evidence that using Intel OMP is both
> problematic and the situation isn’t fixable?  The burden of proof is on the
> ones requesting the change — it is not my responsibility to justify the
> current state.  There must be something “terrible” and unfixable to justify
> a change.  I have seen no proof of this in all this time.
>
> On a side note, I mentioned a couple of things in my email yesterday that
> still are not being responded to (they were also ignored in the last
> incarnation of this “discussion” — I have much experience in this matter to
> assume “discussion” is a waste of my time, seeing and I am not paid to
> “work on” mxnet like y’all are).
>
> -C
>
>
>
>
>
>
> On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> kellen.sunderland@gmail.com> wrote:
>
> > I've also quite often seen two versions of OpenMP linked.  I think we can
> > all agree we probably want to avoid linking in two libraries that do
> > effectively the same thing.
> >
> > The performance questions should be fairly straight forward to demonstrate
> > right?  Could we just collaborate on a few minimal Dockerfiles that show
> > (or don't show) Intel OpenMP performance speedups with the workloads Chris
> > is referencing?
> >
> > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> > stanislav.tsukrov@gmail.com> wrote:
> >
> > > Hi, Chris!
> > >
> > > Stas here - I've gathered that performance data.
> > > Sure thing, I can be wrong, but please elaborate a bit on what we are
> > > missing.
> > > Be assured, intentional misdirection was never a case.
> > >
> > > Thanks a lot for being constructive.
> > >
> > > > Turning Intel OMP on and off (and MKL as well, since it tends to pull
> > in
> > > omp, depending which one is linked in).
> > >
> > > We never ever considered turning MKL off. We are on the same page here -
> > > MKL is crucial for the performance.
> > > Why should we? There's a GOMP-linked version of MKL, that we can use.
> > >
> > > What we did - we measured, if using compilers default OpenMP
> > > implementation instead of referenced source code distribution of OpenMP
> > > makes anything slower.
> > > We have found the impact to be hardly measurable.
> > > The difference between GOMP and iOMP is <5% on our benchmarks, most of
> > the
> > > time less than that.
> > >
> > > We just suggest to simplify the build of mxnet, by removing the
> > > unnecessary dependency.
> > >
> > > During that we discovered for example the following amazing issue:
> > > https://github.com/apache/incubator-mxnet/issues/14087
> > >
> > > Best Regards
> > >
> > > Stas
> > >
> > > On 18.06.19, 18:24, "Chris Olivier" <cj...@gmail.com> wrote:
> > >
> > >     I am very reluctant to feed the trolls again, and this will be teh
> > last
> > >     time I address Pedro or Anton on the subject, but since I think the
> > > numbers
> > >     being presented are incorrect (either by te builders not really
> > >     understanding what they are building, or possibly intentional
> > > misdirection):
> > >
> > >     Turning Intel OMP on and off (and MKL as well, since it tends to pull
> > > in
> > >     omp, depending which one is linked in).
> > >     There is a HUGE difference.  This is consistent with my experience
> > > before
> > >     when it was added.
> > >
> > >
> > >     default mnist:
> > >
> > >     python ../example/image-classification/train_mnist.py
> > >     INFO:root:start with arguments Namespace(add_stn=False,
> > batch_size=64,
> > >     disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
> > >     gpus=None, image_shape='1, 28, 28', initializer='default',
> > >     kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
> > >     lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
> > >     monitor=0, network='mlp', num_classes=10, num_epochs=20,
> > >     num_examples=60000, num_layers=None, optimizer='sgd',
> > >     profile_server_suffix='', profile_worker_suffix='', save_period=1,
> > >     test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear',
> > > wd=0.0001)
> > >
> > >     INTEL OMP:
> > >
> > >     ldd libmxnet.so | grep omp
> > >             libomp.so =>
> > >     /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > >     (0x00007f978fde7000)
> > >
> > >     :root:Epoch[0] Batch [0-100]        Speed: 31548.09 samples/sec
> > >     accuracy=0.780012
> > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21 samples/sec
> > >     accuracy=0.920469
> > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91 samples/sec
> > >     accuracy=0.928281
> > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36 samples/sec
> > >     accuracy=0.942813
> > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79 samples/sec
> > >     accuracy=0.938750
> > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52 samples/sec
> > >     accuracy=0.946562
> > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41 samples/sec
> > >     accuracy=0.953281
> > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73 samples/sec
> > >     accuracy=0.951562
> > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88 samples/sec
> > >     accuracy=0.957500
> > >     INFO:root:Epoch[0] Train-accuracy=0.925423
> > >     INFO:root:Epoch[0] Time cost=3.806
> > >     INFO:root:Epoch[0] Validation-accuracy=0.962580
> > >     INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21 samples/sec
> > >     accuracy=0.968131
> > >     INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03 samples/sec
> > >     accuracy=0.966250
> > >
> > >
> > >     LIBGOMP:
> > >
> > >     ldd libmxnet.so | grep omp
> > >             libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > >     (0x00007f25c25dd000)
> > >
> > >     INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01 samples/sec
> > >      accuracy=0.782488
> > >     INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32 samples/sec
> > >      accuracy=0.907813
> > >     INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00 samples/sec
> > >      accuracy=0.927188
> > >     INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45 samples/sec
> > >      accuracy=0.937969
> > >     INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95 samples/sec
> > >      accuracy=0.942187
> > >     INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58 samples/sec
> > >      accuracy=0.950156
> > >     INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16 samples/sec
> > >      accuracy=0.947969
> > >     INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07 samples/sec
> > >      accuracy=0.953750
> > >     INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58 samples/sec
> > >     accuracy=0.953125
> > >
> > >     That being said, there's other issued beyond speed.  The DEFAULT
> > build
> > > from
> > >     makefile (not CMake) uses Intel OMP mkl (I showed before) and
> > > mysteriously
> > >     it has no issues?  This seems highly suspicious.  All I see is a lot
> > of
> > >     hand-waving and conjecture and pointing to StackOverflow posts made
> > by
> > >     people who may be of questionable pedigree to begin with.  This
> > smells
> > > of a
> > >     Pedro-ego-fight rather than one of purely technical merit.  Also, if
> > > one
> > >     knows how OMP works,  they would be very suspicious of the
> > > "intermittent
> > >     hangs" claim -- that's probably just broken race conditions elsewhere
> > > until
> > >     proven differently.  It'd tend freeze on the first use if something
> > is
> > >     wrong (try using libgomp after a fork and see), since worker threads"
> > >     wouldn't be assigned/joined properly.  IntelOMP is faster, but also
> > has
> > >     other advantages, such as allowing OMP after a fork.
> > >
> > >     I actually addressed a lot of issues and ask for clarification in the
> > >     original PR's way back when, but they're all just ignored.
> > >
> > >     -Chris
> > >
> > >
> > >
> > >
> >

Re: OMP

Posted by Chris Olivier <cj...@apache.org>.
if you’re linking in two then you’re doing something wrong. You can see by
my email yesterday that only one is linked in. This is also the case with
the mkl version built by the Makefile — only the Intel OMP library is used
(no libgomp).

That being said, Do you have clear evidence that using Intel OMP is both
problematic and the situation isn’t fixable?  The burden of proof is on the
ones requesting the change — it is not my responsibility to justify the
current state.  There must be something “terrible” and unfixable to justify
a change.  I have seen no proof of this in all this time.

On a side note, I mentioned a couple of things in my email yesterday that
still are not being responded to (they were also ignored in the last
incarnation of this “discussion” — I have much experience in this matter to
assume “discussion” is a waste of my time, seeing and I am not paid to
“work on” mxnet like y’all are).

-C






On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
kellen.sunderland@gmail.com> wrote:

> I've also quite often seen two versions of OpenMP linked.  I think we can
> all agree we probably want to avoid linking in two libraries that do
> effectively the same thing.
>
> The performance questions should be fairly straight forward to demonstrate
> right?  Could we just collaborate on a few minimal Dockerfiles that show
> (or don't show) Intel OpenMP performance speedups with the workloads Chris
> is referencing?
>
> On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> stanislav.tsukrov@gmail.com> wrote:
>
> > Hi, Chris!
> >
> > Stas here - I've gathered that performance data.
> > Sure thing, I can be wrong, but please elaborate a bit on what we are
> > missing.
> > Be assured, intentional misdirection was never a case.
> >
> > Thanks a lot for being constructive.
> >
> > > Turning Intel OMP on and off (and MKL as well, since it tends to pull
> in
> > omp, depending which one is linked in).
> >
> > We never ever considered turning MKL off. We are on the same page here -
> > MKL is crucial for the performance.
> > Why should we? There's a GOMP-linked version of MKL, that we can use.
> >
> > What we did - we measured, if using compilers default OpenMP
> > implementation instead of referenced source code distribution of OpenMP
> > makes anything slower.
> > We have found the impact to be hardly measurable.
> > The difference between GOMP and iOMP is <5% on our benchmarks, most of
> the
> > time less than that.
> >
> > We just suggest to simplify the build of mxnet, by removing the
> > unnecessary dependency.
> >
> > During that we discovered for example the following amazing issue:
> > https://github.com/apache/incubator-mxnet/issues/14087
> >
> > Best Regards
> >
> > Stas
> >
> > On 18.06.19, 18:24, "Chris Olivier" <cj...@gmail.com> wrote:
> >
> >     I am very reluctant to feed the trolls again, and this will be teh
> last
> >     time I address Pedro or Anton on the subject, but since I think the
> > numbers
> >     being presented are incorrect (either by te builders not really
> >     understanding what they are building, or possibly intentional
> > misdirection):
> >
> >     Turning Intel OMP on and off (and MKL as well, since it tends to pull
> > in
> >     omp, depending which one is linked in).
> >     There is a HUGE difference.  This is consistent with my experience
> > before
> >     when it was added.
> >
> >
> >     default mnist:
> >
> >     python ../example/image-classification/train_mnist.py
> >     INFO:root:start with arguments Namespace(add_stn=False,
> batch_size=64,
> >     disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
> >     gpus=None, image_shape='1, 28, 28', initializer='default',
> >     kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
> >     lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
> >     monitor=0, network='mlp', num_classes=10, num_epochs=20,
> >     num_examples=60000, num_layers=None, optimizer='sgd',
> >     profile_server_suffix='', profile_worker_suffix='', save_period=1,
> >     test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear',
> > wd=0.0001)
> >
> >     INTEL OMP:
> >
> >     ldd libmxnet.so | grep omp
> >             libomp.so =>
> >     /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> >     (0x00007f978fde7000)
> >
> >     :root:Epoch[0] Batch [0-100]        Speed: 31548.09 samples/sec
> >     accuracy=0.780012
> >     INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21 samples/sec
> >     accuracy=0.920469
> >     INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91 samples/sec
> >     accuracy=0.928281
> >     INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36 samples/sec
> >     accuracy=0.942813
> >     INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79 samples/sec
> >     accuracy=0.938750
> >     INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52 samples/sec
> >     accuracy=0.946562
> >     INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41 samples/sec
> >     accuracy=0.953281
> >     INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73 samples/sec
> >     accuracy=0.951562
> >     INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88 samples/sec
> >     accuracy=0.957500
> >     INFO:root:Epoch[0] Train-accuracy=0.925423
> >     INFO:root:Epoch[0] Time cost=3.806
> >     INFO:root:Epoch[0] Validation-accuracy=0.962580
> >     INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21 samples/sec
> >     accuracy=0.968131
> >     INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03 samples/sec
> >     accuracy=0.966250
> >
> >
> >     LIBGOMP:
> >
> >     ldd libmxnet.so | grep omp
> >             libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> >     (0x00007f25c25dd000)
> >
> >     INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01 samples/sec
> >      accuracy=0.782488
> >     INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32 samples/sec
> >      accuracy=0.907813
> >     INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00 samples/sec
> >      accuracy=0.927188
> >     INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45 samples/sec
> >      accuracy=0.937969
> >     INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95 samples/sec
> >      accuracy=0.942187
> >     INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58 samples/sec
> >      accuracy=0.950156
> >     INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16 samples/sec
> >      accuracy=0.947969
> >     INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07 samples/sec
> >      accuracy=0.953750
> >     INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58 samples/sec
> >     accuracy=0.953125
> >
> >     That being said, there's other issued beyond speed.  The DEFAULT
> build
> > from
> >     makefile (not CMake) uses Intel OMP mkl (I showed before) and
> > mysteriously
> >     it has no issues?  This seems highly suspicious.  All I see is a lot
> of
> >     hand-waving and conjecture and pointing to StackOverflow posts made
> by
> >     people who may be of questionable pedigree to begin with.  This
> smells
> > of a
> >     Pedro-ego-fight rather than one of purely technical merit.  Also, if
> > one
> >     knows how OMP works,  they would be very suspicious of the
> > "intermittent
> >     hangs" claim -- that's probably just broken race conditions elsewhere
> > until
> >     proven differently.  It'd tend freeze on the first use if something
> is
> >     wrong (try using libgomp after a fork and see), since worker threads"
> >     wouldn't be assigned/joined properly.  IntelOMP is faster, but also
> has
> >     other advantages, such as allowing OMP after a fork.
> >
> >     I actually addressed a lot of issues and ask for clarification in the
> >     original PR's way back when, but they're all just ignored.
> >
> >     -Chris
> >
> >
> >
> >
>

Re: OMP

Posted by Pedro Larroy <pe...@gmail.com>.
+1 Would be best to have a controlled environment so we can reason
about how MXNet is being built and what libraries are linked. I'm
happy to help here. I would think docker won't have a big impact on
the measurement or distort the results much.


On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland
<ke...@gmail.com> wrote:
>
> I've also quite often seen two versions of OpenMP linked.  I think we can
> all agree we probably want to avoid linking in two libraries that do
> effectively the same thing.
>
> The performance questions should be fairly straight forward to demonstrate
> right?  Could we just collaborate on a few minimal Dockerfiles that show
> (or don't show) Intel OpenMP performance speedups with the workloads Chris
> is referencing?
>
> On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> stanislav.tsukrov@gmail.com> wrote:
>
> > Hi, Chris!
> >
> > Stas here - I've gathered that performance data.
> > Sure thing, I can be wrong, but please elaborate a bit on what we are
> > missing.
> > Be assured, intentional misdirection was never a case.
> >
> > Thanks a lot for being constructive.
> >
> > > Turning Intel OMP on and off (and MKL as well, since it tends to pull in
> > omp, depending which one is linked in).
> >
> > We never ever considered turning MKL off. We are on the same page here -
> > MKL is crucial for the performance.
> > Why should we? There's a GOMP-linked version of MKL, that we can use.
> >
> > What we did - we measured, if using compilers default OpenMP
> > implementation instead of referenced source code distribution of OpenMP
> > makes anything slower.
> > We have found the impact to be hardly measurable.
> > The difference between GOMP and iOMP is <5% on our benchmarks, most of the
> > time less than that.
> >
> > We just suggest to simplify the build of mxnet, by removing the
> > unnecessary dependency.
> >
> > During that we discovered for example the following amazing issue:
> > https://github.com/apache/incubator-mxnet/issues/14087
> >
> > Best Regards
> >
> > Stas
> >
> > On 18.06.19, 18:24, "Chris Olivier" <cj...@gmail.com> wrote:
> >
> >     I am very reluctant to feed the trolls again, and this will be teh last
> >     time I address Pedro or Anton on the subject, but since I think the
> > numbers
> >     being presented are incorrect (either by te builders not really
> >     understanding what they are building, or possibly intentional
> > misdirection):
> >
> >     Turning Intel OMP on and off (and MKL as well, since it tends to pull
> > in
> >     omp, depending which one is linked in).
> >     There is a HUGE difference.  This is consistent with my experience
> > before
> >     when it was added.
> >
> >
> >     default mnist:
> >
> >     python ../example/image-classification/train_mnist.py
> >     INFO:root:start with arguments Namespace(add_stn=False, batch_size=64,
> >     disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
> >     gpus=None, image_shape='1, 28, 28', initializer='default',
> >     kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
> >     lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
> >     monitor=0, network='mlp', num_classes=10, num_epochs=20,
> >     num_examples=60000, num_layers=None, optimizer='sgd',
> >     profile_server_suffix='', profile_worker_suffix='', save_period=1,
> >     test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear',
> > wd=0.0001)
> >
> >     INTEL OMP:
> >
> >     ldd libmxnet.so | grep omp
> >             libomp.so =>
> >     /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> >     (0x00007f978fde7000)
> >
> >     :root:Epoch[0] Batch [0-100]        Speed: 31548.09 samples/sec
> >     accuracy=0.780012
> >     INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21 samples/sec
> >     accuracy=0.920469
> >     INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91 samples/sec
> >     accuracy=0.928281
> >     INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36 samples/sec
> >     accuracy=0.942813
> >     INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79 samples/sec
> >     accuracy=0.938750
> >     INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52 samples/sec
> >     accuracy=0.946562
> >     INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41 samples/sec
> >     accuracy=0.953281
> >     INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73 samples/sec
> >     accuracy=0.951562
> >     INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88 samples/sec
> >     accuracy=0.957500
> >     INFO:root:Epoch[0] Train-accuracy=0.925423
> >     INFO:root:Epoch[0] Time cost=3.806
> >     INFO:root:Epoch[0] Validation-accuracy=0.962580
> >     INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21 samples/sec
> >     accuracy=0.968131
> >     INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03 samples/sec
> >     accuracy=0.966250
> >
> >
> >     LIBGOMP:
> >
> >     ldd libmxnet.so | grep omp
> >             libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> >     (0x00007f25c25dd000)
> >
> >     INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01 samples/sec
> >      accuracy=0.782488
> >     INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32 samples/sec
> >      accuracy=0.907813
> >     INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00 samples/sec
> >      accuracy=0.927188
> >     INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45 samples/sec
> >      accuracy=0.937969
> >     INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95 samples/sec
> >      accuracy=0.942187
> >     INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58 samples/sec
> >      accuracy=0.950156
> >     INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16 samples/sec
> >      accuracy=0.947969
> >     INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07 samples/sec
> >      accuracy=0.953750
> >     INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58 samples/sec
> >     accuracy=0.953125
> >
> >     That being said, there's other issued beyond speed.  The DEFAULT build
> > from
> >     makefile (not CMake) uses Intel OMP mkl (I showed before) and
> > mysteriously
> >     it has no issues?  This seems highly suspicious.  All I see is a lot of
> >     hand-waving and conjecture and pointing to StackOverflow posts made by
> >     people who may be of questionable pedigree to begin with.  This smells
> > of a
> >     Pedro-ego-fight rather than one of purely technical merit.  Also, if
> > one
> >     knows how OMP works,  they would be very suspicious of the
> > "intermittent
> >     hangs" claim -- that's probably just broken race conditions elsewhere
> > until
> >     proven differently.  It'd tend freeze on the first use if something is
> >     wrong (try using libgomp after a fork and see), since worker threads"
> >     wouldn't be assigned/joined properly.  IntelOMP is faster, but also has
> >     other advantages, such as allowing OMP after a fork.
> >
> >     I actually addressed a lot of issues and ask for clarification in the
> >     original PR's way back when, but they're all just ignored.
> >
> >     -Chris
> >
> >
> >
> >

Re: OMP

Posted by kellen sunderland <ke...@gmail.com>.
I've also quite often seen two versions of OpenMP linked.  I think we can
all agree we probably want to avoid linking in two libraries that do
effectively the same thing.

The performance questions should be fairly straight forward to demonstrate
right?  Could we just collaborate on a few minimal Dockerfiles that show
(or don't show) Intel OpenMP performance speedups with the workloads Chris
is referencing?

On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
stanislav.tsukrov@gmail.com> wrote:

> Hi, Chris!
>
> Stas here - I've gathered that performance data.
> Sure thing, I can be wrong, but please elaborate a bit on what we are
> missing.
> Be assured, intentional misdirection was never a case.
>
> Thanks a lot for being constructive.
>
> > Turning Intel OMP on and off (and MKL as well, since it tends to pull in
> omp, depending which one is linked in).
>
> We never ever considered turning MKL off. We are on the same page here -
> MKL is crucial for the performance.
> Why should we? There's a GOMP-linked version of MKL, that we can use.
>
> What we did - we measured, if using compilers default OpenMP
> implementation instead of referenced source code distribution of OpenMP
> makes anything slower.
> We have found the impact to be hardly measurable.
> The difference between GOMP and iOMP is <5% on our benchmarks, most of the
> time less than that.
>
> We just suggest to simplify the build of mxnet, by removing the
> unnecessary dependency.
>
> During that we discovered for example the following amazing issue:
> https://github.com/apache/incubator-mxnet/issues/14087
>
> Best Regards
>
> Stas
>
> On 18.06.19, 18:24, "Chris Olivier" <cj...@gmail.com> wrote:
>
>     I am very reluctant to feed the trolls again, and this will be teh last
>     time I address Pedro or Anton on the subject, but since I think the
> numbers
>     being presented are incorrect (either by te builders not really
>     understanding what they are building, or possibly intentional
> misdirection):
>
>     Turning Intel OMP on and off (and MKL as well, since it tends to pull
> in
>     omp, depending which one is linked in).
>     There is a HUGE difference.  This is consistent with my experience
> before
>     when it was added.
>
>
>     default mnist:
>
>     python ../example/image-classification/train_mnist.py
>     INFO:root:start with arguments Namespace(add_stn=False, batch_size=64,
>     disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
>     gpus=None, image_shape='1, 28, 28', initializer='default',
>     kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
>     lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
>     monitor=0, network='mlp', num_classes=10, num_epochs=20,
>     num_examples=60000, num_layers=None, optimizer='sgd',
>     profile_server_suffix='', profile_worker_suffix='', save_period=1,
>     test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear',
> wd=0.0001)
>
>     INTEL OMP:
>
>     ldd libmxnet.so | grep omp
>             libomp.so =>
>     /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
>     (0x00007f978fde7000)
>
>     :root:Epoch[0] Batch [0-100]        Speed: 31548.09 samples/sec
>     accuracy=0.780012
>     INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21 samples/sec
>     accuracy=0.920469
>     INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91 samples/sec
>     accuracy=0.928281
>     INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36 samples/sec
>     accuracy=0.942813
>     INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79 samples/sec
>     accuracy=0.938750
>     INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52 samples/sec
>     accuracy=0.946562
>     INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41 samples/sec
>     accuracy=0.953281
>     INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73 samples/sec
>     accuracy=0.951562
>     INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88 samples/sec
>     accuracy=0.957500
>     INFO:root:Epoch[0] Train-accuracy=0.925423
>     INFO:root:Epoch[0] Time cost=3.806
>     INFO:root:Epoch[0] Validation-accuracy=0.962580
>     INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21 samples/sec
>     accuracy=0.968131
>     INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03 samples/sec
>     accuracy=0.966250
>
>
>     LIBGOMP:
>
>     ldd libmxnet.so | grep omp
>             libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
>     (0x00007f25c25dd000)
>
>     INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01 samples/sec
>      accuracy=0.782488
>     INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32 samples/sec
>      accuracy=0.907813
>     INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00 samples/sec
>      accuracy=0.927188
>     INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45 samples/sec
>      accuracy=0.937969
>     INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95 samples/sec
>      accuracy=0.942187
>     INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58 samples/sec
>      accuracy=0.950156
>     INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16 samples/sec
>      accuracy=0.947969
>     INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07 samples/sec
>      accuracy=0.953750
>     INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58 samples/sec
>     accuracy=0.953125
>
>     That being said, there's other issued beyond speed.  The DEFAULT build
> from
>     makefile (not CMake) uses Intel OMP mkl (I showed before) and
> mysteriously
>     it has no issues?  This seems highly suspicious.  All I see is a lot of
>     hand-waving and conjecture and pointing to StackOverflow posts made by
>     people who may be of questionable pedigree to begin with.  This smells
> of a
>     Pedro-ego-fight rather than one of purely technical merit.  Also, if
> one
>     knows how OMP works,  they would be very suspicious of the
> "intermittent
>     hangs" claim -- that's probably just broken race conditions elsewhere
> until
>     proven differently.  It'd tend freeze on the first use if something is
>     wrong (try using libgomp after a fork and see), since worker threads"
>     wouldn't be assigned/joined properly.  IntelOMP is faster, but also has
>     other advantages, such as allowing OMP after a fork.
>
>     I actually addressed a lot of issues and ask for clarification in the
>     original PR's way back when, but they're all just ignored.
>
>     -Chris
>
>
>
>

Re: OMP

Posted by "Tsukrov, Stanislav" <st...@gmail.com>.
Hi, Chris!

Stas here - I've gathered that performance data.
Sure thing, I can be wrong, but please elaborate a bit on what we are missing.
Be assured, intentional misdirection was never a case.

Thanks a lot for being constructive. 

> Turning Intel OMP on and off (and MKL as well, since it tends to pull in omp, depending which one is linked in).

We never ever considered turning MKL off. We are on the same page here - MKL is crucial for the performance. 
Why should we? There's a GOMP-linked version of MKL, that we can use.

What we did - we measured, if using compilers default OpenMP implementation instead of referenced source code distribution of OpenMP makes anything slower.
We have found the impact to be hardly measurable. 
The difference between GOMP and iOMP is <5% on our benchmarks, most of the time less than that. 

We just suggest to simplify the build of mxnet, by removing the unnecessary dependency.

During that we discovered for example the following amazing issue:
https://github.com/apache/incubator-mxnet/issues/14087

Best Regards

Stas

On 18.06.19, 18:24, "Chris Olivier" <cj...@gmail.com> wrote:

    I am very reluctant to feed the trolls again, and this will be teh last
    time I address Pedro or Anton on the subject, but since I think the numbers
    being presented are incorrect (either by te builders not really
    understanding what they are building, or possibly intentional misdirection):
    
    Turning Intel OMP on and off (and MKL as well, since it tends to pull in
    omp, depending which one is linked in).
    There is a HUGE difference.  This is consistent with my experience before
    when it was added.
    
    
    default mnist:
    
    python ../example/image-classification/train_mnist.py
    INFO:root:start with arguments Namespace(add_stn=False, batch_size=64,
    disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
    gpus=None, image_shape='1, 28, 28', initializer='default',
    kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
    lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
    monitor=0, network='mlp', num_classes=10, num_epochs=20,
    num_examples=60000, num_layers=None, optimizer='sgd',
    profile_server_suffix='', profile_worker_suffix='', save_period=1,
    test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
    
    INTEL OMP:
    
    ldd libmxnet.so | grep omp
            libomp.so =>
    /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
    (0x00007f978fde7000)
    
    :root:Epoch[0] Batch [0-100]        Speed: 31548.09 samples/sec
    accuracy=0.780012
    INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21 samples/sec
    accuracy=0.920469
    INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91 samples/sec
    accuracy=0.928281
    INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36 samples/sec
    accuracy=0.942813
    INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79 samples/sec
    accuracy=0.938750
    INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52 samples/sec
    accuracy=0.946562
    INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41 samples/sec
    accuracy=0.953281
    INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73 samples/sec
    accuracy=0.951562
    INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88 samples/sec
    accuracy=0.957500
    INFO:root:Epoch[0] Train-accuracy=0.925423
    INFO:root:Epoch[0] Time cost=3.806
    INFO:root:Epoch[0] Validation-accuracy=0.962580
    INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21 samples/sec
    accuracy=0.968131
    INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03 samples/sec
    accuracy=0.966250
    
    
    LIBGOMP:
    
    ldd libmxnet.so | grep omp
            libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
    (0x00007f25c25dd000)
    
    INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01 samples/sec
     accuracy=0.782488
    INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32 samples/sec
     accuracy=0.907813
    INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00 samples/sec
     accuracy=0.927188
    INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45 samples/sec
     accuracy=0.937969
    INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95 samples/sec
     accuracy=0.942187
    INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58 samples/sec
     accuracy=0.950156
    INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16 samples/sec
     accuracy=0.947969
    INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07 samples/sec
     accuracy=0.953750
    INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58 samples/sec
    accuracy=0.953125
    
    That being said, there's other issued beyond speed.  The DEFAULT build from
    makefile (not CMake) uses Intel OMP mkl (I showed before) and mysteriously
    it has no issues?  This seems highly suspicious.  All I see is a lot of
    hand-waving and conjecture and pointing to StackOverflow posts made by
    people who may be of questionable pedigree to begin with.  This smells of a
    Pedro-ego-fight rather than one of purely technical merit.  Also, if one
    knows how OMP works,  they would be very suspicious of the "intermittent
    hangs" claim -- that's probably just broken race conditions elsewhere until
    proven differently.  It'd tend freeze on the first use if something is
    wrong (try using libgomp after a fork and see), since worker threads"
    wouldn't be assigned/joined properly.  IntelOMP is faster, but also has
    other advantages, such as allowing OMP after a fork.
    
    I actually addressed a lot of issues and ask for clarification in the
    original PR's way back when, but they're all just ignored.
    
    -Chris
    



Re: OMP

Posted by Pedro Larroy <pe...@gmail.com>.
First of all, thanks for following up on this topic and not swiping
the problem under the rug. You might very well be right and have some
numbers which corroborate your findings, this might be something to
celebrate. Before continuing our technical discussion I would like to
take a step back and remind you of the code of conduct, since I think
the way your are handling the communication about this issue is not
conductive for a healthy community, It is also not a good leadership
example from a respected engineer and Apache PMC member.

We are all trying to do the best we can for the project and not
everyone is an expert on everything. There are technical decisions
made long ago, sometimes lacking proper documentation and
justifications which even if they are right, constitute technical debt
as it takes a big effort to reverse-engineer or deep dive to
understand all the ramifications which are non-obvious. I called a
vote to clarify the issue and have an opportunity to move a long
standing problem that remains unaddressed and unclear, this is not
trolling, nothing personal nor against anyone nor their work.

I actually just know the basics about OpenMP, so this is hardly about
ego, as it's also not my contribution, I tried to help by providing
some benchmarks requested since I felt the original contributors gave
up trying to help. After we provided info and benchmarks one after
another, you closed the PR in a way that was not well understood.

If there's a flaw on the benchmark you are right to point it out.  If
someone doesn't have time or willingness to coach contributors or
properly explain why a PR is not doing the right thing or document
your technical contributions in a way that we can all align behind and
understand the tradeoffs they shouldn't be exercising the power to
close PRs.

Please take some time to read the code of conduct:

https://www.apache.org/foundation/policies/conduct

There's also other materials about building healthy communities:
https://www.jonobacon.com/books/artofcommunity/

Since we don't all share your particular sense of humor I would
suggest to be prudent, have politeness, patience explaining your
technical decisions and refrain from calling other people's names or
using ad-hominem, as well as assuming good intentions.

I suggested to you before in a private channel to have your findings
and benchmarks documented in the wiki so we can have constructive
conversations and help contributors improve the existing issues with
OpenMP, people come and go to projects, so you can't assume that
everyone knows the reasons why something was done some way two years
ago, also the reasons might change with time.


Pedro.

On Tue, Jun 18, 2019 at 9:24 AM Chris Olivier <cj...@gmail.com> wrote:
>
> I am very reluctant to feed the trolls again, and this will be teh last
> time I address Pedro or Anton on the subject, but since I think the numbers
> being presented are incorrect (either by te builders not really
> understanding what they are building, or possibly intentional misdirection):
>
> Turning Intel OMP on and off (and MKL as well, since it tends to pull in
> omp, depending which one is linked in).
> There is a HUGE difference.  This is consistent with my experience before
> when it was added.
>
>
> default mnist:
>
> python ../example/image-classification/train_mnist.py
> INFO:root:start with arguments Namespace(add_stn=False, batch_size=64,
> disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
> gpus=None, image_shape='1, 28, 28', initializer='default',
> kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
> lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
> monitor=0, network='mlp', num_classes=10, num_epochs=20,
> num_examples=60000, num_layers=None, optimizer='sgd',
> profile_server_suffix='', profile_worker_suffix='', save_period=1,
> test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
>
> INTEL OMP:
>
> ldd libmxnet.so | grep omp
>         libomp.so =>
> /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> (0x00007f978fde7000)
>
> :root:Epoch[0] Batch [0-100]        Speed: 31548.09 samples/sec
> accuracy=0.780012
> INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21 samples/sec
> accuracy=0.920469
> INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91 samples/sec
> accuracy=0.928281
> INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36 samples/sec
> accuracy=0.942813
> INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79 samples/sec
> accuracy=0.938750
> INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52 samples/sec
> accuracy=0.946562
> INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41 samples/sec
> accuracy=0.953281
> INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73 samples/sec
> accuracy=0.951562
> INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88 samples/sec
> accuracy=0.957500
> INFO:root:Epoch[0] Train-accuracy=0.925423
> INFO:root:Epoch[0] Time cost=3.806
> INFO:root:Epoch[0] Validation-accuracy=0.962580
> INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21 samples/sec
> accuracy=0.968131
> INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03 samples/sec
> accuracy=0.966250
>
>
> LIBGOMP:
>
> ldd libmxnet.so | grep omp
>         libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> (0x00007f25c25dd000)
>
> INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01 samples/sec
>  accuracy=0.782488
> INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32 samples/sec
>  accuracy=0.907813
> INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00 samples/sec
>  accuracy=0.927188
> INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45 samples/sec
>  accuracy=0.937969
> INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95 samples/sec
>  accuracy=0.942187
> INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58 samples/sec
>  accuracy=0.950156
> INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16 samples/sec
>  accuracy=0.947969
> INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07 samples/sec
>  accuracy=0.953750
> INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58 samples/sec
> accuracy=0.953125
>
> That being said, there's other issued beyond speed.  The DEFAULT build from
> makefile (not CMake) uses Intel OMP mkl (I showed before) and mysteriously
> it has no issues?  This seems highly suspicious.  All I see is a lot of
> hand-waving and conjecture and pointing to StackOverflow posts made by
> people who may be of questionable pedigree to begin with.  This smells of a
> Pedro-ego-fight rather than one of purely technical merit.  Also, if one
> knows how OMP works,  they would be very suspicious of the "intermittent
> hangs" claim -- that's probably just broken race conditions elsewhere until
> proven differently.  It'd tend freeze on the first use if something is
> wrong (try using libgomp after a fork and see), since worker threads"
> wouldn't be assigned/joined properly.  IntelOMP is faster, but also has
> other advantages, such as allowing OMP after a fork.
>
> I actually addressed a lot of issues and ask for clarification in the
> original PR's way back when, but they're all just ignored.
>
> -Chris

Re: OMP

Posted by Per da Silva <pe...@gmail.com>.
Hi Chris,

It's not clear why you think the numbers are wrong. It seems Stas has taken
a lot of effort to perform the benchmarks and comprehensively write down
the methodology and results. Of course, no one is above making mistakes.
Therefore, it would be great if you could shine some light on what you find
objectionable and maybe add some suggestions for experiments or
improvements. Perhaps you could try to rerun the benchmarks yourself and
reach out if there are any steps that are missing or unclear.

I work with Stas and he's a very talented engineer and his integrity is
above reproach. So, you don't need to fear any "political" motivations
behind his effort. I feel this level of antagonism doesn't help the
community at all. Perhaps we could keep the conversation around the
methodology and the results so we can bring this story to a conclusion.


Cheers,

Per

On Tue., 18 Jun. 2019, 6:24 pm Chris Olivier, <cj...@gmail.com> wrote:

> I am very reluctant to feed the trolls again, and this will be teh last
> time I address Pedro or Anton on the subject, but since I think the numbers
> being presented are incorrect (either by te builders not really
> understanding what they are building, or possibly intentional
> misdirection):
>
> Turning Intel OMP on and off (and MKL as well, since it tends to pull in
> omp, depending which one is linked in).
> There is a HUGE difference.  This is consistent with my experience before
> when it was added.
>
>
> default mnist:
>
> python ../example/image-classification/train_mnist.py
> INFO:root:start with arguments Namespace(add_stn=False, batch_size=64,
> disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
> gpus=None, image_shape='1, 28, 28', initializer='default',
> kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
> lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
> monitor=0, network='mlp', num_classes=10, num_epochs=20,
> num_examples=60000, num_layers=None, optimizer='sgd',
> profile_server_suffix='', profile_worker_suffix='', save_period=1,
> test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
>
> INTEL OMP:
>
> ldd libmxnet.so | grep omp
>         libomp.so =>
> /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> (0x00007f978fde7000)
>
> :root:Epoch[0] Batch [0-100]        Speed: 31548.09 samples/sec
> accuracy=0.780012
> INFO:root:Epoch[0] Batch [100-200]      Speed: 16073.21 samples/sec
> accuracy=0.920469
> INFO:root:Epoch[0] Batch [200-300]      Speed: 19075.91 samples/sec
> accuracy=0.928281
> INFO:root:Epoch[0] Batch [300-400]      Speed: 23211.36 samples/sec
> accuracy=0.942813
> INFO:root:Epoch[0] Batch [400-500]      Speed: 22139.79 samples/sec
> accuracy=0.938750
> INFO:root:Epoch[0] Batch [500-600]      Speed: 23225.52 samples/sec
> accuracy=0.946562
> INFO:root:Epoch[0] Batch [600-700]      Speed: 19547.41 samples/sec
> accuracy=0.953281
> INFO:root:Epoch[0] Batch [700-800]      Speed: 24111.73 samples/sec
> accuracy=0.951562
> INFO:root:Epoch[0] Batch [800-900]      Speed: 13959.88 samples/sec
> accuracy=0.957500
> INFO:root:Epoch[0] Train-accuracy=0.925423
> INFO:root:Epoch[0] Time cost=3.806
> INFO:root:Epoch[0] Validation-accuracy=0.962580
> INFO:root:Epoch[1] Batch [0-100]        Speed: 24560.21 samples/sec
> accuracy=0.968131
> INFO:root:Epoch[1] Batch [100-200]      Speed: 23457.03 samples/sec
> accuracy=0.966250
>
>
> LIBGOMP:
>
> ldd libmxnet.so | grep omp
>         libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> (0x00007f25c25dd000)
>
> INFO:root:Epoch[0] Batch [0-100]        Speed: 1731.01 samples/sec
>  accuracy=0.782488
> INFO:root:Epoch[0] Batch [100-200]      Speed: 3551.32 samples/sec
>  accuracy=0.907813
> INFO:root:Epoch[0] Batch [200-300]      Speed: 1991.00 samples/sec
>  accuracy=0.927188
> INFO:root:Epoch[0] Batch [300-400]      Speed: 2175.45 samples/sec
>  accuracy=0.937969
> INFO:root:Epoch[0] Batch [400-500]      Speed: 1644.95 samples/sec
>  accuracy=0.942187
> INFO:root:Epoch[0] Batch [500-600]      Speed: 6444.58 samples/sec
>  accuracy=0.950156
> INFO:root:Epoch[0] Batch [600-700]      Speed: 7842.16 samples/sec
>  accuracy=0.947969
> INFO:root:Epoch[0] Batch [700-800]      Speed: 9412.07 samples/sec
>  accuracy=0.953750
> INFO:root:Epoch[0] Batch [800-900]      Speed: 12707.58 samples/sec
> accuracy=0.953125
>
> That being said, there's other issued beyond speed.  The DEFAULT build from
> makefile (not CMake) uses Intel OMP mkl (I showed before) and mysteriously
> it has no issues?  This seems highly suspicious.  All I see is a lot of
> hand-waving and conjecture and pointing to StackOverflow posts made by
> people who may be of questionable pedigree to begin with.  This smells of a
> Pedro-ego-fight rather than one of purely technical merit.  Also, if one
> knows how OMP works,  they would be very suspicious of the "intermittent
> hangs" claim -- that's probably just broken race conditions elsewhere until
> proven differently.  It'd tend freeze on the first use if something is
> wrong (try using libgomp after a fork and see), since worker threads"
> wouldn't be assigned/joined properly.  IntelOMP is faster, but also has
> other advantages, such as allowing OMP after a fork.
>
> I actually addressed a lot of issues and ask for clarification in the
> original PR's way back when, but they're all just ignored.
>
> -Chris
>