You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/09/17 11:45:00 UTC

[GitHub] [incubator-mxnet] Lizonghang opened a new issue #16186: Segmentation fault when calling "auto &updates = update_buf_[key]; "

Lizonghang opened a new issue #16186: Segmentation fault when calling "auto &updates = update_buf_[key];"
URL: https://github.com/apache/incubator-mxnet/issues/16186
 
 
   ## Description
   Hello, I divided the standard `DataHandleDefault` interface in kvstore_dist_server.h into `DataHandleSyncDefault` and `DataHandleAsyncDefault` to support the synchrouous mode and asynchronous mode respectively. The `DataHandleSyncDefault` interface works fine but the `DataHandleAsyncDefault` interface suffered from segmentation fault, which is caused by `auto &updates = update_buf_[key];`. I wonder what may cause the faults and how to fix them, thanks!
   
   ## Environment info (Required)
   
   The followings are codes of the interface `DataHandleAsyncDefault`, most are similar to the origin interface `DataHandleDefault`:
   
   ```
   void DataHandleAsyncDefault(const DataHandleType type, const ps::KVMeta& req_meta,
                                                       const ps::KVPairs<char> &req_data,
                                                       ps::KVServer<char>* server) {
       // do some check
       CHECK_EQ(req_data.keys.size(), (size_t)1);
       if (req_meta.push) {
         CHECK_EQ(req_data.lens.size(), (size_t)1);
         CHECK_EQ(req_data.vals.size(), (size_t)req_data.lens[0]);
       }
       CHECK(ps::IsGlobalServer() && !sync_global_mode_);
       int key = DecodeKey(req_data.keys[0]);
       auto& stored = has_multi_precision_copy(type) ? store_realt_[key] : store_[key];
       if (req_meta.push) {
         size_t ds[] = {(size_t) req_data.lens[0] / mshadow::mshadow_sizeof(type.dtype)};
         TShape dshape(ds, ds + 1);
         TBlob recv_blob;
         MSHADOW_REAL_TYPE_SWITCH(type.dtype, DType, {
           recv_blob = TBlob(reinterpret_cast<DType*>(req_data.vals.data()), dshape, cpu::kDevMask);
         });
         NDArray recved = NDArray(recv_blob, 0);
         if (stored.is_none()) {
           // initialization by master worker
           stored = NDArray(dshape, Context(), false,
                            has_multi_precision_copy(type) ? mshadow::kFloat32 : type.dtype);
           CopyFromTo(recved, &stored, 0);
           if (has_multi_precision_copy(type)) {
             auto &stored_dtype = store_[key];
             stored_dtype = NDArray(dshape, Context(), false, type.dtype);
             CopyFromTo(stored, stored_dtype);
             stored_dtype.WaitToRead();
           }
           stored.WaitToRead();
           server->Response(req_meta);
           auto len = stored.shape().Size() * mshadow::mshadow_sizeof(stored.dtype());
           ps::KVPairs<char>broadcast_data;
           broadcast_data.keys.push_back(req_data.keys[0]);
           broadcast_data.lens = {len};
           broadcast_data.vals.CopyFrom(static_cast<const char*>(stored.data().dptr_), len);
           server->Broadcast(broadcast_data);
         } else {
           auto &updates = update_buf_[key];
           if (has_multi_precision_copy(type) && updates.temp_array.is_none()) {
             updates.temp_array = NDArray(dshape, Context(), false, mshadow::kFloat32);
           }
           if (updates.request.empty()) {
             if (has_multi_precision_copy(type)) {
               CopyFromTo(recved, updates.temp_array);
             } else {
               updates.temp_array = recved;
             }
           updates.request.push_back(req_meta);
           ApplyUpdates(type, key, req_data, &updates, server);
         }
       }
     }
   }
   
   inline void ApplyUpdates(const DataHandleType type, const int key,
                                              const ps::KVPairs<char>& req_data, UpdateBuf *update_buf,
                                              ps::KVServer<char>* server) {
       auto& stored = has_multi_precision_copy(type) ? store_realt_[key] : store_[key];
       auto& update = update_buf->temp_array;
       exec_.Exec([this, key, &update, &stored](){
         CHECK(updater_);
         updater_(key, update, &stored);
       });
       for (const auto& req : update_buf->request) {
         DefaultStorageResponse(type, key, req, req_data, server, true);
       }
       update_buf->request.clear();
       if (has_multi_precision_copy(type)) CopyFromTo(stored, store_[key]);
       stored.WaitToRead();
   }
   ```
   
   The INITIALIZATION works fine and segmentation fault occurred in the PUSH stage when calling `auto &updates = update_buf_[key];`. Strangely, faults usually occurred at the 2th or 3th training rounds (when `DataHandleAsyncDefault` is called about 60 times), and sometimes faults did not occur.
   
   ## Error Message:
   I run the server process in the gdb environment (in docker container, 24GB memory, 12 CPUs, 6GB memory swap, unlimited shm size), gdb reported the following information:
   
   ```
   (gdb) ......
   Thread 70 "python" received signal SIGSEGV, Segmentation fault.
   [Switching to Thread 0x7fff78911700 (LWP 22644)]
   0x00007fffd7cddb2f in std::__detail::_Map_base<int, std::pair<int const, mxnet::kvstore::KVStoreDistServer::UpdateBuf>, std::allocator<std::pair<int const, mxnet::kvstore::KVStoreDistServer::UpdateBuf> >, std::__detail::_Select1st, std::equal_to<int>, std::hash<int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true>, true>::operator[](int const&) () from /root/HiPS/lib/libmxnet.so
   (gdb) bt
   #0  0x00007fffd7cddb2f in std::__detail::_Map_base<int, std::pair<int const, mxnet::kvstore::KVStoreDistServer::UpdateBuf>, std::allocator<std::pair<int const, mxnet::kvstore::KVStoreDistServer::UpdateBuf> >, std::__detail::_Select1st, std::equal_to<int>, std::hash<int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true>, true>::operator[](int const&) () from /root/HiPS/lib/libmxnet.so
   #1  0x00007fffd7d14060 in mxnet::kvstore::KVStoreDistServer::DataHandleAsyncDefault(mxnet::kvstore::DataHandleType, ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*) () from /root/HiPS/lib/libmxnet.so
   #2  0x00007fffd7d16d31 in mxnet::kvstore::KVStoreDistServer::DataHandleEx(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*) () from /root/HiPS/lib/libmxnet.so
   #3  0x00007fffd7d1103a in ps::KVServer<char>::Process(ps::Message const&) () from /root/HiPS/lib/libmxnet.so
   #4  0x00007fffd7d53cfc in std::function<void (ps::Message const&)>::operator()(ps::Message const&) const (__args#0=...,
       this=0x555556a462b8) at /usr/include/c++/5/functional:2267
   #5  ps::Customer::Receiving (this=0x555556a462b0) at src/customer.cc:62
   #6  0x00007fffa6d6f421 in std::execute_native_thread_routine_compat (__p=<optimized out>)
       at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:94
   #7  0x00007ffff7bc16ba in start_thread (arg=0x7fff78911700) at pthread_create.c:333
   #8  0x00007ffff78f741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
   Warning: the current language does not match this frame.
   ```
   
   ## What have you tried to solve it?
   
   1. I tried to discard `update_buf_` and apply `recved` directly to update the `store_`, but similar faults occurred when calling` auto& stored = has_multi_precision_copy(type) ? store_realt_[key] : store_[key];`.
   2. I think maybe Docker caused the faults and tried on physical machines but faults still exist.
   3. I found that sometimes other faults also occurred, and I feel like something is wrong with the memory allocation. I have struggled to find the causes for 5 days but unfortunately failed to figure out what is wrong.
   
   a) updates.request.push_back(req_meta);
   ```
   Thread 70 "python" received signal SIGSEGV, Segmentation fault.
   [Switching to Thread 0x7fff7390f700 (LWP 23473)]
   0x00007fffd7cd34e2 in std::vector<ps::KVMeta, std::allocator<ps::KVMeta> >::push_back(ps::KVMeta const&) ()
      from /root/HiPS/lib/libmxnet.so
   (gdb) bt
   #0  0x00007fffd7cd34e2 in std::vector<ps::KVMeta, std::allocator<ps::KVMeta> >::push_back(ps::KVMeta const&) ()
      from /root/HiPS/lib/libmxnet.so
   #1  0x00007fffd7d14d30 in mxnet::kvstore::KVStoreDistServer::DataHandleAsyncDefault(mxnet::kvstore::DataHandleType, ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*) () from /root/HiPS/lib/libmxnet.so
   #2  0x00007fffd7d16d31 in mxnet::kvstore::KVStoreDistServer::DataHandleEx(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*) () from /root/HiPS/lib/libmxnet.so
   #3  0x00007fffd7d1103a in ps::KVServer<char>::Process(ps::Message const&) () from /root/HiPS/lib/libmxnet.so
   #4  0x00007fffd7d53cfc in std::function<void (ps::Message const&)>::operator()(ps::Message const&) const (__args#0=...,
       this=0x555556b63e38) at /usr/include/c++/5/functional:2267
   #5  ps::Customer::Receiving (this=0x555556b63e30) at src/customer.cc:62
   #6  0x00007fffa6d6f421 in std::execute_native_thread_routine_compat (__p=<optimized out>)
       at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:94
   #7  0x00007ffff7bc16ba in start_thread (arg=0x7fff7390f700) at pthread_create.c:333
   #8  0x00007ffff78f741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
   ```
   
   b) updates.temp_array = recved;
   ```
   Thread 70 "python" received signal SIGSEGV, Segmentation fault.
   [Switching to Thread 0x7fff7390b700 (LWP 23591)]
   0x00007fffd4a301d3 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /root/HiPS/lib/libmxnet.so
   (gdb) bt
   #0  0x00007fffd4a301d3 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /root/HiPS/lib/libmxnet.so
   #1  0x00007fffd747d6e5 in mxnet::NDArray::operator=(mxnet::NDArray const&) () from /root/HiPS/lib/libmxnet.so
   #2  0x00007fffd7d14e68 in mxnet::kvstore::KVStoreDistServer::DataHandleAsyncDefault(mxnet::kvstore::DataHandleType, ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*) () from /root/HiPS/lib/libmxnet.so
   #3  0x00007fffd7d16d31 in mxnet::kvstore::KVStoreDistServer::DataHandleEx(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*) () from /root/HiPS/lib/libmxnet.so
   #4  0x00007fffd7d1103a in ps::KVServer<char>::Process(ps::Message const&) () from /root/HiPS/lib/libmxnet.so
   #5  0x00007fffd7d53cfc in std::function<void (ps::Message const&)>::operator()(ps::Message const&) const (__args#0=...,
       this=0x555556a462b8) at /usr/include/c++/5/functional:2267
   #6  ps::Customer::Receiving (this=0x555556a462b0) at src/customer.cc:62
   #7  0x00007fffa6d6f421 in std::execute_native_thread_routine_compat (__p=<optimized out>)
       at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:94
   #8  0x00007ffff7bc16ba in start_thread (arg=0x7fff7390b700) at pthread_create.c:333
   #9  0x00007ffff78f741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
   ```
   
   c) DefaultStorageResponse(type, key, req, req_data, server, true);
   ```
   Thread 70 "python" received signal SIGSEGV, Segmentation fault.
   [Switching to Thread 0x7fff78911700 (LWP 24459)]
   __memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:238
   238	../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: No such file or directory.
   (gdb) bt
   #0  __memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:238
   #1  0x00007fffd7cf94f1 in mxnet::kvstore::KVStoreDistServer::DefaultStorageResponse(mxnet::kvstore::DataHandleType, int, ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*, bool) () from /root/HiPS/lib/libmxnet.so
   #2  0x00007fffd7cf97e7 in mxnet::kvstore::KVStoreDistServer::ApplyUpdates(mxnet::kvstore::DataHandleType, int, ps::KVPairs<char> const&, mxnet::kvstore::KVStoreDistServer::UpdateBuf*, ps::KVServer<char>*) () from /root/HiPS/lib/libmxnet.so
   #3  0x00007fffd7d14d9f in mxnet::kvstore::KVStoreDistServer::DataHandleAsyncDefault(mxnet::kvstore::DataHandleType, ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*) () from /root/HiPS/lib/libmxnet.so
   #4  0x00007fffd7d16d31 in mxnet::kvstore::KVStoreDistServer::DataHandleEx(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*) () from /root/HiPS/lib/libmxnet.so
   #5  0x00007fffd7d1103a in ps::KVServer<char>::Process(ps::Message const&) () from /root/HiPS/lib/libmxnet.so
   #6  0x00007fffd7d53cfc in std::function<void (ps::Message const&)>::operator()(ps::Message const&) const (__args#0=...,
       this=0x555556a462b8) at /usr/include/c++/5/functional:2267
   #7  ps::Customer::Receiving (this=0x555556a462b0) at src/customer.cc:62
   #8  0x00007fffa6d6f421 in std::execute_native_thread_routine_compat (__p=<optimized out>)
       at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:94
   #9  0x00007ffff7bc16ba in start_thread (arg=0x7fff78911700) at pthread_create.c:333
   #10 0x00007ffff78f741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
   ```
   
   I have been in a mess now and very thanks for your help.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services