You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by Corey Nolet <cj...@gmail.com> on 2014/09/02 05:17:39 UTC

Re: Tablet server thrift issue

As an update,

I raised the tablet server memory and I have not seen this error thrown
since. I'd like to say raising the memory, alone, was the solution but it
appears that I also may be having some performance issues with the switches
connecting the racks together. I'll update more as I dive in further.


On Fri, Aug 22, 2014 at 11:41 PM, Corey Nolet <cj...@gmail.com> wrote:

> Josh,
>
> Your advice is definitely useful- I also thought about catching the
> exception and retrying with a fresh batch writer but the fact that the
> batch writer failure doesn't go away without being re-instantiated is
> really only a nuisance. The TabletServerBatchWriter could be designed much
> better, I agree, but that is not the root of the problem.
>
> The Thrift exception that is causing the issue is what I'd like to get to
> the bottom of. It's throwing the following:
>
> *TApplicationException: applyUpdates failed: out of sequence response *
>
> I've never seen this exception before in regular use of the client API-
> but I also just updated to 1.6.0. Google isn't showing anything useful for
> how exactly this exception could come about other than using a bad
> threading model- and I don't see any drastic changes or other user
> complaints on the mailing list that would validate that line of thought.
> Quite frankly, I'm stumped. This could be a Thrift exception related to a
> Thrift bug or something bad on my system and have nothing to do with
> Accumulo.
>
> Chris Tubbs mentioned to me earlier that he recalled Keith and Eric had
> seen the exception before and may remember what it was/how they fixed it.
>
>
> On Fri, Aug 22, 2014 at 10:58 PM, Josh Elser <jo...@gmail.com> wrote:
>
>> Don't mean to tell you that I don't think there might be a bug/otherwise,
>> that's pretty much just the limit of what I know about the server-side
>> sessions :)
>>
>> If you have concrete "this worked in 1.4.4" and "this happens instead
>> with 1.6.0", that'd make a great ticket :D
>>
>> The BatchWriter failure case is pretty rough, actually. Eric has made
>> some changes to help already (in 1.6.1, I think), but it needs an overhaul
>> that I haven't been able to make time to fix properly, either. IIRC, the
>> only guarantee you have is that all mutations added before the last flush()
>> happened are durable on the server. Anything else is a guess. I don't know
>> the specifics, but that should be enough to work with (and saving off
>> mutations shouldn't be too costly since they're stored serialized).
>>
>>
>> On 8/22/14, 5:44 PM, Corey Nolet wrote:
>>
>>> Thanks Josh,
>>>
>>> I understand about the session ID completely but the problem I have is
>>> that
>>> the exact same client code worked, line for line, just fine in 1.4.4 and
>>> it's acting up in 1.6.0. I also seem to remember the BatchWriter
>>> automatically creating a new session when one expired without an
>>> exception
>>> causing it to fail on the client.
>>>
>>> I know we've made changes since 1.4.4 but I'd like to troubleshoot the
>>> actual issue of the BatchWriter failing due to the thrift exception
>>> rather
>>> than just catching the exception and trying mutations again. The other
>>> issue is that I've already submitted a bunch of mutations to the batch
>>> writer from different threads. Does that mean I need to be storing them
>>> off
>>> twice? (once in the BatchWriter's cache and once in my own)
>>>
>>> The BatchWriter in my ingester is constantly sending data and the tablet
>>> servers have been given more than enough memory to be able to keep up.
>>> There's no swap being used and the network isn't experiencing any errors.
>>>
>>>
>>> On Fri, Aug 22, 2014 at 4:54 PM, Josh Elser <jo...@gmail.com>
>>> wrote:
>>>
>>>  If you get an error from a BatchWriter, you pretty much have to throw
>>>> away
>>>> that instance of the BatchWriter and make a new one. See ACCUMULO-2990.
>>>> If
>>>> you want, you should be able to catch/recover from this without having
>>>> to
>>>> restart the ingester.
>>>>
>>>> If the session ID is invalid, my guess is that it hasn't been used
>>>> recently and the tserver cleaned it up. The exception logic isn't the
>>>> greatest (as it just is presented to you as a RTE).
>>>>
>>>> https://issues.apache.org/jira/browse/ACCUMULO-2990
>>>>
>>>>
>>>> On 8/22/14, 4:35 PM, Corey Nolet wrote:
>>>>
>>>>  Eric & Keith, Chris mentioned to me that you guys have seen this issue
>>>>> before. Any ideas from anyone else are much appreciated as well.
>>>>>
>>>>> I recently updated a project's dependencies to Accumulo 1.6.0 built
>>>>> with
>>>>> Hadoop 2.3.0. I've got CDH 5.0.2 deployed. The project has an ingest
>>>>> component which is running all the time with a batch writer using many
>>>>> threads to push mutations into Accumulo.
>>>>>
>>>>> The issue I'm having is a show stopper. At different intervals of time,
>>>>> sometimes an hour, sometimes 30 minutes, I'm getting
>>>>> MutationsRejectedExceptions (server errors) from the
>>>>> TabletServerBatchWriter. Once they start, I need to restart the
>>>>> ingester
>>>>> to
>>>>> get them to stop. They always come back within 30 minutes to an hour...
>>>>> rinse, repeat.
>>>>>
>>>>> The exception always happens on different tablet servers. It's a thrift
>>>>> error saying a message was received out of sequence. In the
>>>>> TabletServer
>>>>> logs, I see an "Invalid session id" exception which happens only once
>>>>> before the client-side batch writer starts spitting out the MREs.
>>>>>
>>>>> I'm running some heavyweight processing in Storm along side the tablet
>>>>> servers. I shut that processing off in hopes that maybe it was the
>>>>> culprit
>>>>> but that hasn't fixed the issue.
>>>>>
>>>>> I'm surprised I haven't seen any other posts on the topic.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>
>