You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/15 00:48:50 UTC

[GitHub] [arrow] westonpace opened a new pull request, #12894: ARROW-14911: [C++] arrow-compute-hash-join-node-test failed

westonpace opened a new pull request, #12894:
URL: https://github.com/apache/arrow/pull/12894

   I identified and reproduced two possible ways this sort of segmentation fault could happen.  The stack traces demonstrated that worker tasks were still running for a plan after the test case had considered the plan "finished" and moved on.
   
   First, the test case was calling gtest asserts in a helper method called from a loop:
   
   ```
   void RunPlan(parameters) {
     Plan plan = MakePlan(parameters);
     ASSERT_TRUE(plan.FinishesInAReasonableTime());
   }
   void Test() {
     // ...
     for (int i = 0; i < kNumTrials; i++) {
       RunPlan(parameters);
     }
   }
   ```
   
   If the plan was sometimes timing out then the assert could be triggered.  A gtest assert simply returns immediately but it would then get swept up into the next iteration of the loop.  I changed the helper method to return a `Result` and put all asserts in the test case.  That being said, I don't think this was the likely failure as I would expect to have seen instances of this test case timing out along with instances where it had a segmentation fault.
   
   The second possibility was a rather unique set of circumstances that I was only able to trigger reliably when inserting sleeps into the test at just the right spots.
   
   Basically, the node has three task groups, `BuildHashTable`, `ProbeQueuedBatches`, and `ScanHashTable`.  It is possible for `ProbeQueuedBatches` to have zero tasks.  This means, when `StartTaskGroup` is called on the probe task group it will immediately finish and call the finish continuation.  The finish continuation could then call `StartTaskGroup` on the scan hash table task.  If the scan hash table task finished quickly then it is possible it would trigger the finished callback of the exec node before the call to `StartTaskGroup->OnTaskGroupFinished` for the *probe* task group finishes returning.  This particular call returned `all_task_groups_finished=false` because it was the *probe* task group and the final task group was the scan task group.  As a result it would try and call `this->ScheduleMore` (still inside `StartTaskGroup`) but by this point `this` was deleted.  Actually, given the stack traces we have, it looks like the call to `ScheduleMore` started, which makes sens
 e as it wasn't a virtual call, but the state of `this` was invalid).
   
   I spent some time trying to figure out how to fix `TaskScheduler` when I realized we already have a convenient fix for this problem.  I added an `AsyncTaskGroup` at the node level to ensure that all thread tasks started by the node finish before the node is marked finished.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] save-buffer commented on pull request #12894: ARROW-14911: [C++] arrow-compute-hash-join-node-test failed

Posted by GitBox <gi...@apache.org>.
save-buffer commented on PR #12894:
URL: https://github.com/apache/arrow/pull/12894#issuecomment-1099776818

   Epic detective effort! Does this fix thread sanitizer too? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] save-buffer commented on pull request #12894: ARROW-14911: [C++] arrow-compute-hash-join-node-test failed

Posted by GitBox <gi...@apache.org>.
save-buffer commented on PR #12894:
URL: https://github.com/apache/arrow/pull/12894#issuecomment-1104628351

   Is this ready to be merged? I'd like to rebase my bloom filter PR on it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #12894: ARROW-14911: [C++] arrow-compute-hash-join-node-test failed

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #12894:
URL: https://github.com/apache/arrow/pull/12894#issuecomment-1099826784

   I wasn't getting TSAN errors on this test case (this is before bloom filter)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12894: ARROW-14911: [C++] arrow-compute-hash-join-node-test failed

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12894:
URL: https://github.com/apache/arrow/pull/12894#issuecomment-1099748054

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12894: ARROW-14911: [C++] arrow-compute-hash-join-node-test failed

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12894:
URL: https://github.com/apache/arrow/pull/12894#issuecomment-1099748044

   https://issues.apache.org/jira/browse/ARROW-14911


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #12894: ARROW-14911: [C++] arrow-compute-hash-join-node-test failed

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #12894:
URL: https://github.com/apache/arrow/pull/12894#issuecomment-1099748302

   CC @michalursa PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace closed pull request #12894: ARROW-14911: [C++] arrow-compute-hash-join-node-test failed

Posted by GitBox <gi...@apache.org>.
westonpace closed pull request #12894: ARROW-14911: [C++] arrow-compute-hash-join-node-test failed
URL: https://github.com/apache/arrow/pull/12894


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #12894: ARROW-14911: [C++] arrow-compute-hash-join-node-test failed

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #12894:
URL: https://github.com/apache/arrow/pull/12894#issuecomment-1107371867

   Benchmark runs are scheduled for baseline = b9952840be6ff7234b416b5b80a48ecd7a5ecf60 and contender = 4f08a9b6d0f1249f3f3246167e18360da52a6f0d. 4f08a9b6d0f1249f3f3246167e18360da52a6f0d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/e08900e5f1724cac97cb54e6e841d920...c2985d33043949998de77c5a2b94a057/)
   [Finished :arrow_down:0.91% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/7bb37ad2bf634508b85cfb0cffd7454c...065cf6a190fd497988f302078c3cefee/)
   [Failed :arrow_down:0.38% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/01b0c2008451469b98f4cf9c6f0f0fba...790e57bcf69749e8ab9e1a6e83ce6dd1/)
   [Finished :arrow_down:0.25% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/86b73d22f35349c889c0b62aed140e4d...20ba331f0dd24ab8996f30f01335d3ef/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/564| `4f08a9b6` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/552| `4f08a9b6` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/550| `4f08a9b6` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/562| `4f08a9b6` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/563| `b9952840` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/551| `b9952840` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/549| `b9952840` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/561| `b9952840` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org