You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/11 02:59:52 UTC

[GitHub] [arrow] westonpace commented on a change in pull request #12339: ARROW-14908: [C++][R] Dataset hash join segfaults on Windows [WIP]

westonpace commented on a change in pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#discussion_r804328939



##########
File path: cpp/src/arrow/compute/exec/hash_join.cc
##########
@@ -151,6 +151,7 @@ class HashJoinBasicImpl : public HashJoinImpl {
   }
 
   void InitLocalStateIfNeeded(size_t thread_index) {
+    DCHECK_LT(thread_index, local_states_.size());
     ThreadLocalState& local_state = local_states_[thread_index];

Review comment:
       > Even with Sys.setenv(OMP_THREAD_LIMIT = "1") this still occurs.
   
   That isn't too surprising.  `use_threads` triggers an entirely different path in some places.  So it is not entirely equivalent to `OMP_THREAD_LIMIT = "1"`.
   
   > I also tried writing a C++ unit test that did a join after a dataset scan, but I couldn't reproduce the problem. That leads me to think there may be some issue with how the R bindings are configuring things, but it could also be I just didn't reproduce it quite well enough.
   
   How consistent is the R error?
   
   > Despite use_threads = FALSE, it seems like there are quite a few threads spawned by the engine. While I'm learning, I'm just not familiar enough to know which parts seem weird.
   
   `use_threads` generally does not control the I/O thread pool (which defaults to 8 threads and is not controlled by `OMP_THREAD_LIMIT`).  If someone was really passionate about shoving everything onto the calling thread then there is a way to do this but it would be quite a bit of work.
   
   In addition, jemalloc (if compiled in), will spawn some background cleanup threads.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org