You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/15 00:48:50 UTC

[GitHub] [arrow] westonpace opened a new pull request, #12894: ARROW-14911: [C++] arrow-compute-hash-join-node-test failed

westonpace opened a new pull request, #12894:
URL: https://github.com/apache/arrow/pull/12894

I identified and reproduced two possible ways this sort of segmentation fault could happen. The stack traces demonstrated that worker tasks were still running for a plan after the test case had considered the plan "finished" and moved on.

First, the test case was calling gtest asserts in a helper method called from a loop:

```
void RunPlan(parameters) {
Plan plan = MakePlan(parameters);
ASSERT_TRUE(plan.FinishesInAReasonableTime());
}
void Test() {
// ...
for (int i = 0; i < kNumTrials; i++) {
RunPlan(parameters);
}
}
```

If the plan was sometimes timing out then the assert could be triggered. A gtest assert simply returns immediately but it would then get swept up into the next iteration of the loop. I changed the helper method to return a `Result` and put all asserts in the test case. That being said, I don't think this was the likely failure as I would expect to have seen instances of this test case timing out along with instances where it had a segmentation fault.

The second possibility was a rather unique set of circumstances that I was only able to trigger reliably when inserting sleeps into the test at just the right spots.

Basically, the node has three task groups, `BuildHashTable`, `ProbeQueuedBatches`, and `ScanHashTable`. It is possible for `ProbeQueuedBatches` to have zero tasks. This means, when `StartTaskGroup` is called on the probe task group it will immediately finish and call the finish continuation. The finish continuation could then call `StartTaskGroup` on the scan hash table task. If the scan hash table task finished quickly then it is possible it would trigger the finished callback of the exec node before the call to `StartTaskGroup->OnTaskGroupFinished` for the *probe* task group finishes returning. This particular call returned `all_task_groups_finished=false` because it was the *probe* task group and the final task group was the scan task group. As a result it would try and call `this->ScheduleMore` (still inside `StartTaskGroup`) but by this point `this` was deleted. Actually, given the stack traces we have, it looks like the call to `ScheduleMore` started, which makes sens
e as it wasn't a virtual call, but the state of `this` was invalid).

I spent some time trying to figure out how to fix `TaskScheduler` when I realized we already have a convenient fix for this problem. I added an `AsyncTaskGroup` at the node level to ensure that all thread tasks started by the node finish before the node is marked finished.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org