You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/18 19:17:37 UTC

[GitHub] [arrow] michalursa commented on pull request #10845: ARROW-13268 [C++][Compute] Add ExecNode for semi and anti-semi join

michalursa commented on pull request #10845:
URL: https://github.com/apache/arrow/pull/10845#issuecomment-901366200


   I looked into the failure in unit tests and here is what is happening.
   
   InputReceived() method for semi-join exec node uses ThreadIndexer class to map current thread into an integer number between 0 and thread pool's Capacity() - 1. There is Capacity() number of local states that can be used by threads and therefore it is important that ThreadIndexer never returns any number greater than the mentioned limit. 
   
   ThreadIndexer adds one more index for every thread that uses it. Threads that use it in semi-join are the threads that call InputReceived() on its exec node. In the failing unit test the Capacity() is N but the number of threads calling InputReceived() is N+1. The N threads come from a thread pool passed to generators used in source array scan exec nodes for both build and probe side of the join. The extra 1 thread is a thread that executes the unit test and does not belong to the same thread pool.
   
   How is this possible? In the failing case, the exec plan uses default execution context which has thread pool pointer set to null. That means that the intention is to run the plan single-threaded, parallel execution should not have it set to null. Scan nodes are bound to generators and generators use thread pools given to them (independently of execution context for exec plan) to execute multi-threaded scan. Scan nodes are supposed to transfer tasks from generator thread pool to the exec plan thread pool in case of parallel scan. But in this case parallel scan is met with a serial plan, there is no thread pool to transfer to and therefore the original thread is used (I assume). We get a mix of parallel (from scan's generators) and single-threaded (from plan) execution logic. This situation is not supposed to happen - parallel scan should be used with parallel execution plan.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org