You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@thrift.apache.org by "Marc Kwiatkowski (JIRA)" <ji...@apache.org> on 2009/05/06 09:32:30 UTC

[jira] Commented: (THRIFT-488) ThreadManager crashing bugs

    [ https://issues.apache.org/jira/browse/THRIFT-488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706325#action_12706325 ] 

Marc Kwiatkowski commented on THRIFT-488:
-----------------------------------------

Rush:

I need to go through this in detail but I'm not sure this is the right approach to the problem.  I wonder if it makes more sense to wrap the Worker threads' references to ThreadManager with a boost:weak_ptr/boost::shared_ptr.  Currently the ThreadManager dtor calls stop(false) ie don't block for worker threads.  My original thinking was that if the client didn't want the ThreadManager to hang around waiting for potentially stuck worker threads, it shouldn't, but as you note that leads to references to dtor'ed objects.

Switching to weak_ptr allows ThreadManager's dtor to keep the best effort stop and still let active threads clean up properly after the primary reference to ThreadManager is dropped.

Let me know what you think.

> ThreadManager crashing bugs
> ---------------------------
>
>                 Key: THRIFT-488
>                 URL: https://issues.apache.org/jira/browse/THRIFT-488
>             Project: Thrift
>          Issue Type: Bug
>          Components: Library (C++)
>    Affects Versions: 0.1
>         Environment: Mac OS X 10.5.6, Xcode
>            Reporter: Rush Manbert
>         Attachments: ThreadManager.cpp.patch
>
>
> The ThreadManager::Impl and the ThreadManager::Worker classes work together to execute client threads. There are race conditions between the two classes that can cause Bus Error exceptions in certain cases. In general, these occur when a ThreadManager instance is destructed, so might not be seen in long running programs. They happen frequently enough, though, that looped repetitions of ThreadManagerTests::blockTest() (part of the concurrency_test program) fail quite often.
> These errors are generally not seen with the current version of ThreadManagerTests::blockTest() due to errors in the test itself that cause failures at a far higher frequency. In order to see them, you need to apply the patches that are attached to THRIFT-487 (https://issues.apache.org/jira/browse/THRIFT-487).
> Test procedure:
> 1) Apply the patch from THRIFT-487 for the Tests.cpp file.
> 2) Run make in lib/cpp in order to rebuild concurrency_test
> 3) Run concurrency_test with the command line argument "thread-manager" and observe that the test hangs in no time.
> 4) Apply the patch from THRIFT-487 for the ThreadManagerTests.h file.
> 5) Run make in lib/cpp
> 6) Run concurrency_test as before. Observe that now it runs for longer (generally) and usually fails with an assert in Monitor.cpp. This failure is because of one of the bugs in ThreadManager.
> 7) Apply the attached patch file for ThreadManager.cpp
> 8) Run make in lib/cpp
> 9) Run concurrency_test as before. It should just run, and run, and run.
> Note that there is a possible path through the original ThreadManager::Worker::run() method where active never becomes true. In practice, exercising this code path is difficuly. The way that I exercised it was to edit line 322 in the patched version of ThreadManager.cpp. I changed the for statement to read:
> for (size_t ix = 0; ix < value + 1; ix++)
> so that the ThreadManager always created more workers than were needed. That extra worker caused quite a bit of trouble until I moved his handling up to the top of the run() method. I don't understand how this situation could occur in real life, but this code appears to handle it correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.