You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by ShawnWalker <gi...@git.apache.org> on 2016/04/15 16:48:28 UTC

[GitHub] accumulo pull request: ACCUMULO-4191 Tracing on client can sometim...

GitHub user ShawnWalker opened a pull request:

    https://github.com/apache/accumulo/pull/94

    ACCUMULO-4191 Tracing on client can sometimes lose "sendMutations" events.

    Wrapped a `Runnable` with `Trace.wrap(...)` in `TabletServerBatchWriter`.  This should allow proper tracing propagation across the binning thread pool, and consequently prevent loss of "sendMutation" events.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ShawnWalker/accumulo ACCUMULO-4191

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/accumulo/pull/94.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #94
    
----
commit 2407283e43613290225084737036b9ae8a76bc98
Author: Shawn Walker <ac...@shawn-walker.net>
Date:   2016-04-15T14:38:37Z

    ACCUMULO-4191 Wrapped runnable with Trace.wrap in TabletServerBatchWriter to prevent loss of sendMutation events

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request: ACCUMULO-4191 Tracing on client can sometim...

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on the pull request:

    https://github.com/apache/accumulo/pull/94#issuecomment-210903451
  
    > Curiously, ShellServerIT#trace() is failing for me on 1.7 and master, but not 1.6.
    
    Aha, 1.7 and master have the htrace dependency. I wonder if we worked around something in cloudtrace.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request: ACCUMULO-4191 Tracing on client can sometim...

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/94#discussion_r59886456
  
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/impl/TabletServerBatchWriter.java ---
    @@ -733,7 +733,7 @@ else if (Tables.getTableState(context.getInstance(), table) == TableState.OFFLIN
         void queueMutations(final MutationSet mutationsToSend) throws InterruptedException {
           if (null == mutationsToSend)
             return;
    -      binningThreadPool.execute(new Runnable() {
    +      binningThreadPool.execute(Trace.wrap(new Runnable() {
    --- End diff --
    
    There's actually a TraceRunnable class which would do the same thing (perhaps a bit more concisely)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request: ACCUMULO-4191 Tracing on client can sometim...

Posted by ShawnWalker <gi...@git.apache.org>.
Github user ShawnWalker commented on the pull request:

    https://github.com/apache/accumulo/pull/94#issuecomment-210540256
  
    > ... the "lower" half of the span would be un-rooted (the part doing binning and sending).
    Is "unrooted" different from "unreported"?  My view of the tracing message suggests so, but I'm uncertain as to exactly the difference.
    
    > Have you been able to notice your changes positively affecting ShellServerIT#trace's success rate? I have seen it fail now and again, but never reliably.
    
    For me, `ShellServerIT.trace()` _always_ fails.  I've been banging my head against this for the past 3 days, trying to get tests to succeed while working on ACCUMULO-4187.  I'll take your word that it doesn't fail for everyone all the time.  With this change, `ShellServerIT` has succeeded for me the 5 or 6 times I've run it.
    
    > Any idea as to why we only sometimes see ShellServerIT fail? Is there a reason why this doesn't always fail that you noticed?
    
    My current theory is that it's a race condition depending on how long it takes to spin up a new threadpool:
    
    At line `TabletServerBatchWriter` line 670, `binningThreadPool` uses a `SynchronousQueue`, and at line 671, the `binningThreadPool` has a `CallerRunsPolicy`.  Together these mean "perform binning in the other thread, but only if it's not currently busy doing binning; otherwise just do the binning in the current thread."
    
    If the binning is done in the current thread, there is no need to propagate tracing information; as you point out, this information is carried by thread-local variables.  If the binning ends up being done in the thread pool, the tracing information needs to be propagated.
    
    But there's a race condition here: The `SynchronousQueue` will only allow a job to be pushed into it if there's already a thread sitting waiting on a job.  So the race ends up being which happens first: the consumer thread created to service `binningThreadPool` is ready to accept work, or the producer thread is ready to commit some of its mutations.  In the former case, the work gets done in the thread pool; in the latter case, the work gets done in the current thread.
    
    > Any thoughts on how we could make a test which would specifically check for regressions here?
    
    Race conditions are notably difficult to diagnose.  What's more, this particular race condition isn't an error (assuming the tracing is properly handled regardless of who wins the race).  I can think of a few ways one might better expose this particular bug under lab conditions, but any such test would by necessity be timing sensitive, and I do so hate knowingly writing timing-sensitive tests.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request: ACCUMULO-4191 Tracing on client can sometim...

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on the pull request:

    https://github.com/apache/accumulo/pull/94#issuecomment-210902049
  
    Curiously, `ShellServerIT#trace()` is failing for me on 1.7 and master, but not 1.6. Looking into that before I apply these changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request: ACCUMULO-4191 Tracing on client can sometim...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/accumulo/pull/94


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request: ACCUMULO-4191 Tracing on client can sometim...

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on the pull request:

    https://github.com/apache/accumulo/pull/94#issuecomment-210903933
  
    >  I wonder if we worked around something in cloudtrace
    
    Yeeeeeep. In `NamingThreadFactory#newThread(Runnable)`.
    
    ```
    return new Daemon(new LoggingRunnable(log, new TraceRunnable(r)), name + " " + threadNum.getAndIncrement());
    ```
    
    I'll apply this to 1.7 and master. Thanks again for the detailed analysis, @ShawnWalker! Great work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request: ACCUMULO-4191 Tracing on client can sometim...

Posted by billierinaldi <gi...@git.apache.org>.
Github user billierinaldi commented on the pull request:

    https://github.com/apache/accumulo/pull/94#issuecomment-210527065
  
    This change looks reasonable to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request: ACCUMULO-4191 Tracing on client can sometim...

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/94#discussion_r59888276
  
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/impl/TabletServerBatchWriter.java ---
    @@ -733,7 +733,7 @@ else if (Tables.getTableState(context.getInstance(), table) == TableState.OFFLIN
         void queueMutations(final MutationSet mutationsToSend) throws InterruptedException {
           if (null == mutationsToSend)
             return;
    -      binningThreadPool.execute(new Runnable() {
    +      binningThreadPool.execute(Trace.wrap(new Runnable() {
    --- End diff --
    
    Oh, that's lame. You're totally right -- I thought the former is how it worked. Maybe I'm remembering pre-HTrace? What you have sounds great then :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request: ACCUMULO-4191 Tracing on client can sometim...

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on the pull request:

    https://github.com/apache/accumulo/pull/94#issuecomment-210545040
  
    > For me, ShellServerIT.trace() always fails. I've been banging my head against this for the past 3 days, trying to get tests to succeed while working on ACCUMULO-4187. I'll take your word that it doesn't fail for everyone all the time. With this change, ShellServerIT has succeeded for me the 5 or 6 times I've run it.
    
    Sorry about this. @billierinaldi had pointed out to me that she thinks ACCUMULO-1755 had broken this. I didn't know that this one was broken. If our tests are not reliably passing, that's wrong/bad. Give us a shout before bashing your head next time :)
    
    > Race conditions are notably difficult to diagnose.
    
    Yep, that's why I asked :).
    
    >  I can think of a few ways one might better expose this particular bug under lab conditions, but any such test would by necessity be timing sensitive, and I do so hate knowingly writing timing-sensitive tests.
    
    Yep, totally not asking you to do so. I was looking for some understanding that you 1. put thought into whether or not this is directly test-able and 2. you had a plausible idea of why this was sometimes happening. Sounds fine to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request: ACCUMULO-4191 Tracing on client can sometim...

Posted by joshelser <gi...@git.apache.org>.
Github user joshelser commented on the pull request:

    https://github.com/apache/accumulo/pull/94#issuecomment-210495111
  
    > This should allow proper tracing propagation across the binning thread pool, and consequently prevent loss of "sendMutation" events.
    > In TabletServerBatchWriter.MutationWriter.queueMutations(...) (line 736), HTrace Span's are not properly propagated across thread boundaries. Consequently, tracing doesn't propagate into TabletServerBatchWriter.SendTask.send(...), and so the sendMutation event can fail to be logged.
    
    So, IIRC, Tracer maintains the current Span via a ThreadLocal. So, I could see spawning a new Thread in the binning process might lose that, and the "lower" half of the span would be un-rooted (the part doing binning and sending).
    
    A couple of questions:
    * Any thoughts on how we could make a test which would specifically check for regressions here?
    * Have you been able to notice your changes positively affecting `ShellServerIT#trace`'s success rate? I have seen it fail now and again, but never reliably.
    * Any idea as to why we only *sometimes* see ShellServerIT fail? Is there a reason why this doesn't always fail that you noticed?
    
    Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request: ACCUMULO-4191 Tracing on client can sometim...

Posted by ShawnWalker <gi...@git.apache.org>.
Github user ShawnWalker commented on the pull request:

    https://github.com/apache/accumulo/pull/94#issuecomment-210550520
  
    @joshelser 
    > I was looking for some understanding that you ... had a plausible idea of why this was sometimes happening. Sounds fine to me.
    
    It seemed obvious to me that not propagating the tracing was an error.  Before you asked, I had just decided it was "some random race condition".  Coming up with a plausible explanation made me read what was going on much more closely.
    
    If this really is broken for everyone on master, there's a much more plausible explanation why it might have occasionally failed before:  the `TimerTask` created at `TabletServerBatchWriter` line 210 can cause mutations to be sent without propagating the tracing information.  Particularly the fact that it has an initial delay of 0 would make for a timing sensitive race condition:  either {{close()}} arranges to send the mutations (traceable) or the timer arranges to send the mutations (untraceable).
    
    Fixing the timer task to propagate tracing would not be quite so trivial a task as the patch proposed in this pull request.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request: ACCUMULO-4191 Tracing on client can sometim...

Posted by ShawnWalker <gi...@git.apache.org>.
Github user ShawnWalker commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/94#discussion_r59887691
  
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/impl/TabletServerBatchWriter.java ---
    @@ -733,7 +733,7 @@ else if (Tables.getTableState(context.getInstance(), table) == TableState.OFFLIN
         void queueMutations(final MutationSet mutationsToSend) throws InterruptedException {
           if (null == mutationsToSend)
             return;
    -      binningThreadPool.execute(new Runnable() {
    +      binningThreadPool.execute(Trace.wrap(new Runnable() {
    --- End diff --
    
    Looking at `org.apache.htrace.wrappers.TraceRunnable`, it's not really designed to be used as you're suggesting.  One would hope to be able to write:
    ```java
    threadPool.execute(new TraceRunnable() { ... })
    ```
    
    But the best that `TraceRunnable` allows is:
    ```java
    threadPool.execute(new TraceRunnable(new Runnable() {...}));
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request: ACCUMULO-4191 Tracing on client can sometim...

Posted by madrob <gi...@git.apache.org>.
Github user madrob commented on the pull request:

    https://github.com/apache/accumulo/pull/94#issuecomment-210531758
  
    Might be better to wrap the whole pool instead of each individual runnable using `tracer.newTraceExecutorService(...)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---